2013-08-29

By Jason Q. Ng

In 2008, Baidu’s chief scientist William Chang said, “There’s, in fact, no reason for China to use Wikipedia . . . It’s very natural for China to make its own products.” Today Hudong (baike.com) and Baidu Baike (baike.baidu.com) greatly eclipse the Chinese-language version of Wikipedia despite (or because of) the censorship known to take place on the sites. However, identifying outright instances or patterns in censorship can be difficult due to the (mostly) user-generated nature and oversight of the content. Instead, this project attempts to perform a large-scale comparison of the three services, matching thousands of Chinese-language Wikipedia articles with their in-China counterparts, in order to identify the “content gaps” in the two baike (Chinese for “encyclopedia,” which we use to refer to Hudong’s and Baidu’s online encyclopedias). Censorship—or at the very least anomalies in the generation of content—might be identified by articles that don’t exist, “protected” articles that are not editable by regular users, and by articles that are much shorter than those on Wikipedia China. The reason might  is emphasized is due to the distributed oversight nature of these online encyclopedias, where not only governments but also companies and users get to play the role of content gatekeeper. This decentralization makes attributing who is responsible for apparent censorship more difficult, a topic which this report will explore in detail by examining how it functions in these online encyclopedias.

In addition to the exploring the difficulties in identfying censorship, this post will also lay out the research methodology of the article matching, some of the initial results, and the next steps to be taken in the coming months as we continue to analyze this data and expand the project. As the data for this project has just been collected, this report is more data dump than fully-composed, critical analysis (see the final section for future avenues of research and thinking), but hopefully it serves as an introduction into the sorts of quality data available in this project.

Tables with lists of articles that are protected/locked on Wikipedia, Baidu Baike, and Hudong as well as a list of articles that are found on Wikipedia China but not found on the two baike are below. You can jump directly to them by clicking the links in the previous sentence, but as the data is still preliminary and has many limitations, one might be best served reading through the following sections.

Baike: Chinese encyclopedias

As William Chang of Baidu foretold, mainland Chinese netizens have gravitated toward local products such as Hudong and Baidu Baike, leaving Wikipedia China to be edited and read primarily by users in Taiwan, Hong Kong, and the rest of the Chinese diaspora. Today, in terms of raw visitors and article count, Hudong and Baidu Baike dwarf Wikipedia China, which has roughly 700,000 articles versus over 5 million in each of the two baike. Certainly, while China’s sporadic blocking of access to Wikipedia at various points over the past ten years has certainly been a factor in limiting Wikipedia China’s growth among mainland users, Baidu and Hudong’s dominance may be more credited to Baidu’s entrenched position as the dominant search engine in China (thus allowing for cross-site “partnerships” and synergies1) and Hudong’s bevy of features built into its custom wiki and social networking platform.

Though the baike are incredible sources of information on China, they have been dogged by allegations that they have liberally “borrowed” content from other websites, including Wikipedia–a not damnable offense in and of itself since Wikipedia’s content is free to share and re-use, but the two baike are for-profit and inform users that any content contributed is property of Hudong and Baidu. In the past, Baidu was noted as the worst offender, plagiarizing from not only Wikipedia without credit but also from Hudong.2 Though the goal of this project is not to analyze these plagiarization claims and to quantify the amount of shared content among the three encyclopedias, the data generated from the matching of articles as explained in the methodology section below would allow someone to easily perform this sort of follow-up analysis once the data from this project is cleaned up and released.

The difficulties of identifying censorship in an environment with distributed oversight

This project began with a question: everyone “knows” Hudong and Baidu Baike, like all Chinese websites, have to restrict certain kinds of content on their websites, but is it possible to empirically prove that censorship is taking place on the sites? Thinking about different ways of testing our assumption with the publicly available data on the websites drove this project.

First, what would we consider signs of censorship on Hudong and Baidu Baike? The most obvious would be the lack of certain articles on topics that are known to be notable. Thus, using Wikipedia China as a control, we can propose that if an article on say 上海帮 (The Shanghai Gang) exists on Wikipedia China, barring censorship, it should exist on Hudong and Baidu Baike, especially since Hudong and Baidu Baike have a much larger library of entries. However, attributing the lack of an entry due to censorship is not a perfect science since Wikipedia China itself isn’t a perfect control–though entries in Wikipedia China are assumed to be of interest to Hudong and Baidu Baike users, and thus should have articles in those encyclopedias, one should keep in mind that Wikipedia China does tend to have a Taiwanese bend due to its userbase. However, using missing articles as a potential indicator for possible censorship–especially if the article that is missing is a long one–is a reasonable start.

Second, comparisons could also be made between the length of the articles between the encyclopedias. For instance, an article might exist on all three services, but they might be drastically shorter than their Wikipedia counterpart. For instance, the main body of the Wikipedia entry for 艾未未 (Ai Weiwei) is over 20,000 characters long (spaces removed) while the Baidu entry for him clocks in at 2,000 characters and the Hudong one at 3,500–this is despite the fact that article lengths for the Baidu articles sampled in this project are on the whole longer than Wikipedia’s: the mean Wikipedia China article length is 8,174 characters (median: 4,207) while Baidu’s is 12,679 (median: 9,029). Discrepancies of the sort in the Ai Weiwei article might simply be a case of greater interest in the topic outside mainland China than within, or, again, it might be another potential indicator of censorship.

Third, some especially sensitive or controversial articles are unable to be edited except by users with much higher privileges than ordinary members. Being unable to change an article is not in and of itself a sign of censorship; for instance, Wikipedia “protects” certain articles to prevent vandalism and pointless back-and-forth “edit wars.” Baidu and Hudong no doubt have similar intentions in mind as well, but what matters here is matter of transparency–while Wikipedia publishes a list of all protected pages, as far as we can tell, no such corresponding list exists for Hudong and Baidu Baike–and authority. Was it the choice of editors and users at Hudong and Baidu to classify certain articles as locked or was the decision made higher? Was there an open discussion of such matters or was a list handed down from somewhere above? Interviewing regular users of Baidu Baike and Hudong might provide insight into such questions, but for now, we have some data to start with.

Finally, there are more subtle ways to disrupt access to information, many of which we will no doubt uncover as we continue to sift through the data. One that we’ve noticed is the “failure” by Hudong and Baidu to redirect certain article titles the same way that Wikipedia does. For instance, a search for 艾神, a laudatory nickname meaning “God Ai” for Ai Weiwei, properly redirects to the Wikipedia article for him. Hudong and Baidu Baike don’t perform such re-directs. Again, whether this is a conscious decision or merely an inadvertent one cannot be answered by looking at this one example. However, by looking at such cases in the aggregate one might be able to make a more legitimate claim that something might be going on.

Many new media outlets such as Sina Weibo privilege users with the ability to generate the content that goes on the website–in essence, to be not only their own programmer or broadcaster but also the producer. However, because such websites host their users’ content, they are also in charge of regulating and ensuring that such content complies with all Chinese laws–regardless of how vague such regulations might be. Thus attributing censorship that takes place on these sites can be unclear–is it the government that mandated certain topics are off-limits or is it the company that restricts the content?–an intentional feature of the decentralized system of information control that Chinese authorities have developed.

Censorship is further distributed on the baike because now not only are users their own programmer and producer, but they also serve in an oversight capacity as an editor. Unlike Wikipedia, users who aren’t registered can’t begin editing and creating articles, but for the most part, registered users can edit most general articles, and as they engage with the site longer, they achieve greater and greater levels of ability to edit and oversee the website. Thus, there could are always at least three potential reasons for why an article doesn’t exist, an article is shorter, or an article is locked on Hudong or Baidu Baike: government entities, private companies, or users themselves. Judging whether or not these factors are genuine instances of governmental censorship or due to explainable, organic reasons can be quite tricky. Because of the multiple layers of oversight, what may appear to be outright censorship may be a less malicious (though no less pernicious) case of self-censorship.

Methodology

As mentioned, this project is a large-scale attempt to quickly and automatically match thousands of Wikipedia China articles with their corresponding Hudong and Baidu Baike entry. A script was developed which did just that, taking a keyword, locating the correct Wikipedia article and scraping the desired data, and then repeating the process for Hudong and Baidu, respectively, before moving on to the next keyword–essentially relying on Hudong and Baidu to perform the proper title matching or, if the title did not match exactly, redirect to the appropriate article.

Zhichun Wang, Juanzi Li, Zhigang Wang, and others at the Computer Science Department at Tsinghua University have already made great strides in “knowledge linking” between English Wikipedia and Hudong/Baidu Baike, and their approach to matching articles across different languages using machine learning to read semantic information is well beyond the scope of this report. However, techniques from their work may be employed in the future, though the current title matching approach employed for this projecy is already a fairly robust solution as is–especially since we are dealing with only one language. In “Cross-lingual knowledge linking across wiki knowledge bases,” Wang et al. reported that simple title matching gave them a precision percentage in excess of 99% even when having to translate from Chinese to English; however the recall rate was an atrocious 32%.

The data collected for this project still has to be evaluated properly, but thus far results seem very promising. Out of 5,143 keywords tested, Hudong and and Baidu each successfully found or redirected us to an appropriate article 3,539 and 3,600 times respectively, a recall rate of at least 68%–a number which is likely much higher once we account for the fact that the script also attempted to use search results to identify and match instances when the keyword was found in the returned search results’ snippets (the article titles and summaries).

Of these 3,500+ times, the exact title–meaning a character for character match–was made between Wikipedia and Hudong 2,037 times, and 2,080 times between Wikipeda and Baidu Baike. If one assumes that an exact match in the article title is a match in the article content as well, then the precision rating for matched articles is nearly 60% already, and that’s not yet taking into account all the articles titles which are only slightly different between the services, either due to different styling conventions (e.g. Wikipedia’s article on national universities is titled with traditional characters [國立大學] while Hudong and Baidu both use simplified [国立大学]) or other minor variations. Overall, the matching procedure, at least for the terms tested thus far, seems more than acceptable for the goals of this project, though of course attempts will be made to verify the precision and increase the recall rate.

As a test of the script and the methodology, a sample of 5,143 keywords involving a mix of topics known to be sensitive and those known to be popular were generated from five different sources: the titles of articles protected by Wikipedia (282); article titles from the list of Wikipedia China articles on GreatFire.org’s watchlist of censored Wikipedia pages (489); article titles of the top 1000 most viewed articles on Wikipedia China in April 2013 (995 after a few Wikipedia meta-pages were removed); the titles of articles that generated more than a total of 10 combined views on August 1, 0:00-1:00 and 12:00-13:00 (3,470); and finally, the only non-Wikipedia source, a list of keywords from the website Blocked on Weibo previously confirmed to have been prevented from returning search results on Sina Weibo (840). Some of the final sample of keywords appeared on multiple lists, and the overlap of four of the sources is shown below (the source of words from August 1 were dropped because 5-way Venn diagrams are not as pretty…).



4-way Venn diagram showing overlap of sources used for keywords (not shown: keywords taken from most viewed Wikipedia China articles on Aug 1)
generated with Venny

The following variables were scraped from all three encyclopedias: length of article, URL, title of the article (which can differ from the URL), locked/protected status, last modified date, and a number of other meta variables. From Hudong and Baidu, other variables included the number of “likes” (that is positive votes that the page was “helpful”) and number of edits. Variables unique to Baidu were pageviews for the article, and whether or not the page contained multiple entries (unlike Wikipedia, Baidu Baike doesn’t serve up “disambiguation” pages when the user searches for a broad term which could refer to multiple topics, and instead serves up all the entries on a single page; see for instance this entry for “Jordan” which includes the basketball player Michael Jordan as well as the country Jordan). Variables unique to Hudong were number of editors and the error message given when trying to edit a locked page–either 对不起,您的题目中含有不当内容,请检查? (“Sorry, the topic contains inappropriate content, please check?”);  您不能编辑敏感词条 (“You are not allowed to edit sensitive entries”);  您不能编辑普通敏感词条 (“You are not allowed to edit general sensitive entries”).

Protected, missing, and censored(?) content

As mentioned previously, an article that is protected or locked doesn’t necessarily mean that it is being censored–Wikipedia often uses article protection as a means to prevent malicious behavior. However, the locking down of articles itself can be abused and used in a malicious manner, especially if the decision to do so is non-transparent, arbitrary, and/or doesn’t reflect the sentiment of the group. In such cases, Wikipedia fortunately has active “talk pages” which allow all users to debate and contest such issues. Baidu Baike also has discussion pages which allow users to to something similar: for instance, here’s one for the entry on the United Kingdom wherein users quibble about whether or not England can lay claim to being the place where the first “bourgeois revolution” took place (the consensus was no, the Dutch Revolt came before it). However, many protected Baidu Baike entries contain no such talk pages. For instance, trying to reach Xi Jinping’s talk page returns an error message. Hudong eschews separate talk pages in favor of comments at the bottom of each entry. However, again, oftentimes for sensitive and/or protected entries, there is no opportunity to leave your comment; the comments box simply doesn’t appear.

Furthermore, while the list of pages protected by Wikipedia is published by the site, as far as we can tell, no such list exists for Hudong or Baidu Baike. Thus, even with he caveats noted above about how the articles in the following three tables don’t necessarily conclude that censorship is taking place in those entries, this exercise still seems like a useful one if only to make such information publicly available. Again, proper analysis is yet to be done on what is protected/locked and what sorts of patterns exist in the three different encyclopedias approaches to article protection.

Note: the machine translations come from Google Translate. Those that have been corrected or verified have notes or a dot in the third column; if the third column is empty, the translation is still to be confirmed–an ongoing part of this project. Please do not disseminate unconfirmed machine translations without first verifying that they are correct.

Protected articles on Wikipedia China

Article title + Wikipedia link

Machine translation

Human translation / notes (if field is blank, translation still to be confirmed)

民主黨_(香港)

Democratic Party (Hong Kong)

.

庾澄慶

Harlem Yu

Taiwanese pop star

東方報業集團

OrientalPress Group

Hong Kong publisher of newspapers: Oriental Daily News

丁部領

Dinh Bo Linh

Vietnamese emperor

蒼井空

Sora Aoi

Japanse porn star

蔡煌瑯

Tsai Huang-lang

Taiwanese politician

首页

homepage

.

蔡英文

Tsai Ing-wen

新店救護車阻擋事件

Storeambulance blocking event

草榴社区

Grass garnet Community

Caoliu online forum

丁文雄

DingwenXiong

AV女優

AV Actress

.

南陽郡

NanyangCounty

林書豪

Jeremy Lin

basketball player

阮文雄

RuanwenXiong

星野亞希

Aki Hoshino

濱田翔子_(演員)

Bin Shoko(Actor)

佐佐木希

Sasaki Nozomi

朝河兰

NorthRiver Portland

濱崎步

Ayumi Hamasaki

Japanese porn star

原紗央莉

OriginalSaori

小澤瑪麗亞

Maria Ozawa

Angelababy

Angelababy

佐山愛

Ai Sayama

郭書瑤

GuoShu-Yao

大纪元时报

The Epoch Times

Falun Gong-connected publication

國立大學

national university

.

公立大學

Public universities

.

全球定位系统

GlobalPositioning System

国父_(罗马帝国)

Pater Patriae (Roman Empire)

“Father of the Country” honorific

潘佩珠

Phan

哥德巴赫猜想

Goldbach’s conjecture

彭淮南

Perng

吳淑珍

Wu Shu-chen

第十四世达赖喇嘛

FourteenthDalai Lama

宋教仁

Sung

司马南

Sima Nan

天海翼

Day sea wing

性高潮

Orgasm

蔡依林

Jolin

满族

Manchu

U-KISS

U-KISS

维基百科

Wikipedia

印度神油

Indian god oil

江泽民

Jiang

东京热

Tokyo Hot

正覺同修會

ChingPractitioners Association

六合彩

LOTTO

越南政黨列表

Vietnameseparty list



Dry

朱明

Zhu

中华人民共和国行政区划

People’s Republic of administrativedivisions

齡記書店

Ling Kee

2012年香港立法會選舉

2012 Hong Kong Legislative CouncilElection

真佛宗

TrueBuddha School

海椰子

Sea coconut

鶴佬陸上扒龍船

GrilledHoklo dragon boat onshore

馮光遠

Peng says

蔡瑞月

Tsai

衛星定位系統

Satellite positioning system

大连理工大学

DalianUniversity of Technology

拌麵

Noodles

MediaWiki

MediaWiki

利菁

Lee Ching

狭义相对论

SpecialRelativity

南京大學校友列表

Nanjing University Alumni List

李长春

LiChangchun

有栖川宮

Arisugawanomiya

郑州加州工业城

ZhengzhouCity of Industry, California

莫莉花

Molly Flowers

曾偉恩

Zeng Wayne

捕捉、絕育、釋放

Trap-Neuter-Release

自慰

Masturbation

顺阳范氏

Shun Yang Fan

港西镇_(崇明县)

Hong KongWest Town (Chongming County)

陳炳

Chen Ping

發正念

HairMindfulness

許瑜真

Xu Yu True

十二年國教學生研討會

Twelvenational seminar teaching students

任天堂溥天

Nintendo Popteam

南阳范氏

NanyangFan

阿坎巴羅雕像

Aquin Barrow Statue

伊卡黑石

IcaBlackstone

瘂弦

Ya Xian

呂泉生

LvQuansheng

中国大陆

Chinese mainland

丁先皇

DingXianhuang

天主教香港教區

Catholic Diocese of Hong Kong

蔡衍明

Tsai

小游戏

Small game

曹昂

Cao Ang

曹鑠

Cao Shuo

加拉帕戈斯象龜

Galapagostortoises

北西摩島

North Seymour Island

Hao123网址之家

Hao123website

中华人民共和国政府认定的邪教组织列表

PRC Government has identified alist of cult

被政府認定為邪教的團體列表

Identifiedby the government as a cult group list

馬三家女子勞教所

Masanjia Women’s Forced Labor Camp

神韵艺术团

Shen YunPerforming Arts

多面体

Polyhedron

巴丹群島

BatanesIslands

蘇家屯事件

Sujiatun

薄瓜瓜

Bo Guagua

拳王_(電視劇)

Muhammad (TV series)

朱雪璋

ZhuxueZhang

黑魔女學園

Black witch academy

缺宅男女

Lack ofhouse men and women

楊怡

Tavia

伏見宮

FushimiPalace

对毛泽东的评价

Evaluation of Mao

中国共产党中央委员会主席毛泽东同志支持美国黑人抗暴斗争的声明

ChineseCommunist Party Central Committee with Comrade Mao Zedong declared thestruggle to support African-American uprising

蟾蜍

Toad

闲院宫载仁亲王

CourtPalace, Prince Akishino idle load

皇室

Imperial family

當旺爸爸

When thebusy dad

李克强

Li Keqiang

幾米

Jimmy

维权运动

Rights movement

骨部

Bony part

行政院環境保護署

Environmental Protection Agency

南華足球隊

SouthChina Football Team

鍾嘉欣

Linda

行政院

ExecutiveYuan

朝鲜战争

Korean War

蔡宇傑

Cai Yujie

越南共和国

Republic of Vietnam

陳智雄

ChenZhixiong

陈良宇

Chen Liangyu

單戀雙城

Unrequitedlove Twins

交通部中央氣象局

Central Weather Bureau

中華民國教育部

Republicof China Ministry of Education

中華民國交通部

Republic of China Ministry ofTransportation

巨輪

Large ship

九江十二坊

Jiujiang twelve Square

香港獨立媒體

HongKong’s Independent Media

方志敏

Fang Zhimin

梁振英

LeungChun-ying

表情符号

Emoticons

雷霆掃毒

Thunderantidrug

摘星之旅

Reaching for the Stars Tour

Yummy_Yummy

YummyYummy

宣萱

Jessic

Show more