Jackathon #7 Data Hacking

This week’s theme is “hacking data”, on which Initium Lab, with 6 friends invited, tried to analyze or visualize two data sets and produce meaningful presentations. We introduced two of our own data sets to participants: 1) structured data of CCTV news broadcast; 2) structured data of HK Research Grant Council (RGC), which Initium Lab had made some simple processing in advance.

The former one was crawled from CCTV’s website, consisting of news scripts said by the anchors of CCTV’s daily news broadcast from 2011 to 2015. The latter was crawled from Hong Kong Research Grants Council’s (RGC) website, including all funded projects with their detailed information, among 8 institutions in Hong Kong in recent 10 years. Participants could pick one of the data sets and any works related to Data Journalism were welcomed.

Idea Brainstorm

We started with a brainstorm in which participants made lively discussions and showed their great interest in the two data sets we offered. Inspired by stock market prediction, Carlson Zhuo, one of our guests who is from the Chinese University of Hong Kong, came out of the idea of making political prediction by analyzing CCTV news. Chao Tianyi, researcher with Initium Lab, said we could analyze how death events were reported on CCTV. As for RGC data set, Andy Shu, our front-end engineer and editor suggested investigating the most funded professors, etc. After this fruitful brainstorm, we started the Jackathon.

Jackathoning together

What Are the Fruits?

About CCTV Data Set

We made several pretty interesting outputs with this data set.

Andy created a Xinwenlianbo (CCTV News Simulcast) search engine in nearly 5 hours. Using this search engine one can search how many times a keyword is mentioned in the CCTV data set. It displays the results in the form of bar chart along with numbers. For example, if we want to know how many times “计划生育” (Family Planning in China) appeared in CCTV news recent years, we can search for keyword “计划生育” and see the result. Andy explained that the number only represented number of news reports rather than that of words, i.e. if a word appeared many times in one piece of news, it will only count as “1”.

Searching result of “计划生育” in Xinwenlianbo
Fig: Searching result of “计划生育”

Now that we’ve got a search engine, then what did we find? There is a funny joke about CCTV news spreading over the Internet in China, that is “ CCTV daily news tells but three things: busy leaders, happy Chinese and suffering foreigners.” Playing with data Victoria Jin and Yan Rong, a student from Hong Kong Baptist University, found evidence to support the view. They found headlines related to “busy leaders” covered 30% of the whole CCTV news data. In addition, positive words were usually used on China and Chinese people, whereas foreign countries and foreigners were usually accompanied by negative words.

Ratio of types in CCTV News
Fig: Rows-Total, Disaster, Happened, Attacked, Conflict, Turbulence, Loss, Failed. Columns-All news, International news, Ratio.

Bonnie Wang, also a student from Hong Kong Baptist University, analyzed different tones on special days, including Chinese Spring Festival and Tomb-sweeping Day. She found that CCTV would like to use more positive words than negative ones on these two days, which has verified her assumption in the brainstorm section. Carlson tried to see whether we can use CCTV news data to predict national policy and stock market. And Chaotian studied how great people’s death were reported and counted their frequency.

Usage frequency of “逝世” (passed away) of different people
Fig: Usage frequency of “逝世” (passed away).

About RGC Data Set

As for RGC data set, we also made some interesting discoveries.

Serina Xu, a PhD student in Computer Science department of The University of Hong Kong, made a collaborators graph to see potential collaborations among scholars in The University of Hong Kong. She visualized her data with KUMU and found that more outstanding schools usually showed more cooperation. Besides, it is professors who usually sit at the cross center surrounded by several doctors.

Ronald Tse, who is from The Chinese University of Hong Kong, visualized the performance indicator of different universities with couple of indicators and also the efficiency. He used some specific measurements to evaluate scholars who have had obtained the highest fund, and their outputs.

Edited by Liu Xue