By Jason Q. Ng
In 2008, Baidu’s chief scientist William Chang said, “There’s, in fact, no reason for China to use Wikipedia . . . It’s very natural for China to make its own products.” Today Hudong (baike.com) and Baidu Baike (baike.baidu.com) greatly eclipse the Chinese-language version of Wikipedia despite (or because of) the censorship known to take place on the sites. However, identifying outright instances or patterns in censorship can be difficult due to the (mostly) user-generated nature and oversight of the content. Instead, this project attempts to perform a large-scale comparison of the three services, matching thousands of Chinese-language Wikipedia articles with their in-China counterparts, in order to identify the “content gaps” in the two baike (Chinese for “encyclopedia,” which we use to refer to Hudong’s and Baidu’s online encyclopedias). Censorship—or at the very least anomalies in the generation of content—might be identified by articles that don’t exist, “protected” articles that are not editable by regular users, and by articles that are much shorter than those on Wikipedia China. The reason might is emphasized is due to the distributed oversight nature of these online encyclopedias, where not only governments but also companies and users get to play the role of content gatekeeper. This decentralization makes attributing who is responsible for apparent censorship more difficult, a topic which this report will explore in detail by examining how it functions in these online encyclopedias.
In addition to the exploring the difficulties in identfying censorship, this post will also lay out the research methodology of the article matching, some of the initial results, and the next steps to be taken in the coming months as we continue to analyze this data and expand the project. As the data for this project has just been collected, this report is more data dump than fully-composed, critical analysis (see the final section for future avenues of research and thinking), but hopefully it serves as an introduction into the sorts of quality data available in this project.
Tables with lists of articles that are protected/locked on Wikipedia, Baidu Baike, and Hudong as well as a list of articles that are found on Wikipedia China but not found on the two baike are below. You can jump directly to them by clicking the links in the previous sentence, but as the data is still preliminary and has many limitations, one might be best served reading through the following sections.
Baike: Chinese encyclopedias
As William Chang of Baidu foretold, mainland Chinese netizens have gravitated toward local products such as Hudong and Baidu Baike, leaving Wikipedia China to be edited and read primarily by users in Taiwan, Hong Kong, and the rest of the Chinese diaspora. Today, in terms of raw visitors and article count, Hudong and Baidu Baike dwarf Wikipedia China, which has roughly 700,000 articles versus over 5 million in each of the two baike. Certainly, while China’s sporadic blocking of access to Wikipedia at various points over the past ten years has certainly been a factor in limiting Wikipedia China’s growth among mainland users, Baidu and Hudong’s dominance may be more credited to Baidu’s entrenched position as the dominant search engine in China (thus allowing for cross-site “partnerships” and synergies1) and Hudong’s bevy of features built into its custom wiki and social networking platform.
Though the baike are incredible sources of information on China, they have been dogged by allegations that they have liberally “borrowed” content from other websites, including Wikipedia–a not damnable offense in and of itself since Wikipedia’s content is free to share and re-use, but the two baike are for-profit and inform users that any content contributed is property of Hudong and Baidu. In the past, Baidu was noted as the worst offender, plagiarizing from not only Wikipedia without credit but also from Hudong.2 Though the goal of this project is not to analyze these plagiarization claims and to quantify the amount of shared content among the three encyclopedias, the data generated from the matching of articles as explained in the methodology section below would allow someone to easily perform this sort of follow-up analysis once the data from this project is cleaned up and released.
The difficulties of identifying censorship in an environment with distributed oversight
This project began with a question: everyone “knows” Hudong and Baidu Baike, like all Chinese websites, have to restrict certain kinds of content on their websites, but is it possible to empirically prove that censorship is taking place on the sites? Thinking about different ways of testing our assumption with the publicly available data on the websites drove this project.
First, what would we consider signs of censorship on Hudong and Baidu Baike? The most obvious would be the lack of certain articles on topics that are known to be notable. Thus, using Wikipedia China as a control, we can propose that if an article on say 上海帮 (The Shanghai Gang) exists on Wikipedia China, barring censorship, it should exist on Hudong and Baidu Baike, especially since Hudong and Baidu Baike have a much larger library of entries. However, attributing the lack of an entry due to censorship is not a perfect science since Wikipedia China itself isn’t a perfect control–though entries in Wikipedia China are assumed to be of interest to Hudong and Baidu Baike users, and thus should have articles in those encyclopedias, one should keep in mind that Wikipedia China does tend to have a Taiwanese bend due to its userbase. However, using missing articles as a potential indicator for possible censorship–especially if the article that is missing is a long one–is a reasonable start.
Second, comparisons could also be made between the length of the articles between the encyclopedias. For instance, an article might exist on all three services, but they might be drastically shorter than their Wikipedia counterpart. For instance, the main body of the Wikipedia entry for 艾未未 (Ai Weiwei) is over 20,000 characters long (spaces removed) while the Baidu entry for him clocks in at 2,000 characters and the Hudong one at 3,500–this is despite the fact that article lengths for the Baidu articles sampled in this project are on the whole longer than Wikipedia’s: the mean Wikipedia China article length is 8,174 characters (median: 4,207) while Baidu’s is 12,679 (median: 9,029). Discrepancies of the sort in the Ai Weiwei article might simply be a case of greater interest in the topic outside mainland China than within, or, again, it might be another potential indicator of censorship.
Third, some especially sensitive or controversial articles are unable to be edited except by users with much higher privileges than ordinary members. Being unable to change an article is not in and of itself a sign of censorship; for instance, Wikipedia “protects” certain articles to prevent vandalism and pointless back-and-forth “edit wars.” Baidu and Hudong no doubt have similar intentions in mind as well, but what matters here is matter of transparency–while Wikipedia publishes a list of all protected pages, as far as we can tell, no such corresponding list exists for Hudong and Baidu Baike–and authority. Was it the choice of editors and users at Hudong and Baidu to classify certain articles as locked or was the decision made higher? Was there an open discussion of such matters or was a list handed down from somewhere above? Interviewing regular users of Baidu Baike and Hudong might provide insight into such questions, but for now, we have some data to start with.
Finally, there are more subtle ways to disrupt access to information, many of which we will no doubt uncover as we continue to sift through the data. One that we’ve noticed is the “failure” by Hudong and Baidu to redirect certain article titles the same way that Wikipedia does. For instance, a search for 艾神, a laudatory nickname meaning “God Ai” for Ai Weiwei, properly redirects to the Wikipedia article for him. Hudong and Baidu Baike don’t perform such re-directs. Again, whether this is a conscious decision or merely an inadvertent one cannot be answered by looking at this one example. However, by looking at such cases in the aggregate one might be able to make a more legitimate claim that something might be going on.
Many new media outlets such as Sina Weibo privilege users with the ability to generate the content that goes on the website–in essence, to be not only their own programmer or broadcaster but also the producer. However, because such websites host their users’ content, they are also in charge of regulating and ensuring that such content complies with all Chinese laws–regardless of how vague such regulations might be. Thus attributing censorship that takes place on these sites can be unclear–is it the government that mandated certain topics are off-limits or is it the company that restricts the content?–an intentional feature of the decentralized system of information control that Chinese authorities have developed.
Censorship is further distributed on the baike because now not only are users their own programmer and producer, but they also serve in an oversight capacity as an editor. Unlike Wikipedia, users who aren’t registered can’t begin editing and creating articles, but for the most part, registered users can edit most general articles, and as they engage with the site longer, they achieve greater and greater levels of ability to edit and oversee the website. Thus, there could are always at least three potential reasons for why an article doesn’t exist, an article is shorter, or an article is locked on Hudong or Baidu Baike: government entities, private companies, or users themselves. Judging whether or not these factors are genuine instances of governmental censorship or due to explainable, organic reasons can be quite tricky. Because of the multiple layers of oversight, what may appear to be outright censorship may be a less malicious (though no less pernicious) case of self-censorship.
Methodology
As mentioned, this project is a large-scale attempt to quickly and automatically match thousands of Wikipedia China articles with their corresponding Hudong and Baidu Baike entry. A script was developed which did just that, taking a keyword, locating the correct Wikipedia article and scraping the desired data, and then repeating the process for Hudong and Baidu, respectively, before moving on to the next keyword–essentially relying on Hudong and Baidu to perform the proper title matching or, if the title did not match exactly, redirect to the appropriate article.
Zhichun Wang, Juanzi Li, Zhigang Wang, and others at the Computer Science Department at Tsinghua University have already made great strides in “knowledge linking” between English Wikipedia and Hudong/Baidu Baike, and their approach to matching articles across different languages using machine learning to read semantic information is well beyond the scope of this report. However, techniques from their work may be employed in the future, though the current title matching approach employed for this projecy is already a fairly robust solution as is–especially since we are dealing with only one language. In “Cross-lingual knowledge linking across wiki knowledge bases,” Wang et al. reported that simple title matching gave them a precision percentage in excess of 99% even when having to translate from Chinese to English; however the recall rate was an atrocious 32%.
The data collected for this project still has to be evaluated properly, but thus far results seem very promising. Out of 5,143 keywords tested, Hudong and and Baidu each successfully found or redirected us to an appropriate article 3,539 and 3,600 times respectively, a recall rate of at least 68%–a number which is likely much higher once we account for the fact that the script also attempted to use search results to identify and match instances when the keyword was found in the returned search results’ snippets (the article titles and summaries).
Of these 3,500+ times, the exact title–meaning a character for character match–was made between Wikipedia and Hudong 2,037 times, and 2,080 times between Wikipeda and Baidu Baike. If one assumes that an exact match in the article title is a match in the article content as well, then the precision rating for matched articles is nearly 60% already, and that’s not yet taking into account all the articles titles which are only slightly different between the services, either due to different styling conventions (e.g. Wikipedia’s article on national universities is titled with traditional characters [國立大學] while Hudong and Baidu both use simplified [国立大学]) or other minor variations. Overall, the matching procedure, at least for the terms tested thus far, seems more than acceptable for the goals of this project, though of course attempts will be made to verify the precision and increase the recall rate.
As a test of the script and the methodology, a sample of 5,143 keywords involving a mix of topics known to be sensitive and those known to be popular were generated from five different sources: the titles of articles protected by Wikipedia (282); article titles from the list of Wikipedia China articles on GreatFire.org’s watchlist of censored Wikipedia pages (489); article titles of the top 1000 most viewed articles on Wikipedia China in April 2013 (995 after a few Wikipedia meta-pages were removed); the titles of articles that generated more than a total of 10 combined views on August 1, 0:00-1:00 and 12:00-13:00 (3,470); and finally, the only non-Wikipedia source, a list of keywords from the website Blocked on Weibo previously confirmed to have been prevented from returning search results on Sina Weibo (840). Some of the final sample of keywords appeared on multiple lists, and the overlap of four of the sources is shown below (the source of words from August 1 were dropped because 5-way Venn diagrams are not as pretty…).
4-way Venn diagram showing overlap of sources used for keywords (not shown: keywords taken from most viewed Wikipedia China articles on Aug 1)
generated with Venny
The following variables were scraped from all three encyclopedias: length of article, URL, title of the article (which can differ from the URL), locked/protected status, last modified date, and a number of other meta variables. From Hudong and Baidu, other variables included the number of “likes” (that is positive votes that the page was “helpful”) and number of edits. Variables unique to Baidu were pageviews for the article, and whether or not the page contained multiple entries (unlike Wikipedia, Baidu Baike doesn’t serve up “disambiguation” pages when the user searches for a broad term which could refer to multiple topics, and instead serves up all the entries on a single page; see for instance this entry for “Jordan” which includes the basketball player Michael Jordan as well as the country Jordan). Variables unique to Hudong were number of editors and the error message given when trying to edit a locked page–either 对不起,您的题目中含有不当内容,请检查? (“Sorry, the topic contains inappropriate content, please check?”); 您不能编辑敏感词条 (“You are not allowed to edit sensitive entries”); 您不能编辑普通敏感词条 (“You are not allowed to edit general sensitive entries”).
Protected, missing, and censored(?) content
As mentioned previously, an article that is protected or locked doesn’t necessarily mean that it is being censored–Wikipedia often uses article protection as a means to prevent malicious behavior. However, the locking down of articles itself can be abused and used in a malicious manner, especially if the decision to do so is non-transparent, arbitrary, and/or doesn’t reflect the sentiment of the group. In such cases, Wikipedia fortunately has active “talk pages” which allow all users to debate and contest such issues. Baidu Baike also has discussion pages which allow users to to something similar: for instance, here’s one for the entry on the United Kingdom wherein users quibble about whether or not England can lay claim to being the place where the first “bourgeois revolution” took place (the consensus was no, the Dutch Revolt came before it). However, many protected Baidu Baike entries contain no such talk pages. For instance, trying to reach Xi Jinping’s talk page returns an error message. Hudong eschews separate talk pages in favor of comments at the bottom of each entry. However, again, oftentimes for sensitive and/or protected entries, there is no opportunity to leave your comment; the comments box simply doesn’t appear.
Furthermore, while the list of pages protected by Wikipedia is published by the site, as far as we can tell, no such list exists for Hudong or Baidu Baike. Thus, even with he caveats noted above about how the articles in the following three tables don’t necessarily conclude that censorship is taking place in those entries, this exercise still seems like a useful one if only to make such information publicly available. Again, proper analysis is yet to be done on what is protected/locked and what sorts of patterns exist in the three different encyclopedias approaches to article protection.
Note: the machine translations come from Google Translate. Those that have been corrected or verified have notes or a dot in the third column; if the third column is empty, the translation is still to be confirmed–an ongoing part of this project. Please do not disseminate unconfirmed machine translations without first verifying that they are correct.
Protected articles on Wikipedia China
Article title + Wikipedia link
Machine translation
Human translation / notes (if field is blank, translation still to be confirmed)
民主黨_(香港)
Democratic Party (Hong Kong)
.
庾澄慶
Harlem Yu
Taiwanese pop star
東方報業集團
OrientalPress Group
Hong Kong publisher of newspapers: Oriental Daily News
丁部領
Dinh Bo Linh
Vietnamese emperor
蒼井空
Sora Aoi
Japanse porn star
蔡煌瑯
Tsai Huang-lang
Taiwanese politician
首页
homepage
.
蔡英文
Tsai Ing-wen
新店救護車阻擋事件
Storeambulance blocking event
草榴社区
Grass garnet Community
Caoliu online forum
丁文雄
DingwenXiong
AV女優
AV Actress
.
南陽郡
NanyangCounty
林書豪
Jeremy Lin
basketball player
阮文雄
RuanwenXiong
星野亞希
Aki Hoshino
濱田翔子_(演員)
Bin Shoko(Actor)
佐佐木希
Sasaki Nozomi
朝河兰
NorthRiver Portland
濱崎步
Ayumi Hamasaki
Japanese porn star
原紗央莉
OriginalSaori
小澤瑪麗亞
Maria Ozawa
Angelababy
Angelababy
佐山愛
Ai Sayama
郭書瑤
GuoShu-Yao
大纪元时报
The Epoch Times
Falun Gong-connected publication
國立大學
national university
.
公立大學
Public universities
.
全球定位系统
GlobalPositioning System
国父_(罗马帝国)
Pater Patriae (Roman Empire)
“Father of the Country” honorific
潘佩珠
Phan
哥德巴赫猜想
Goldbach’s conjecture
彭淮南
Perng
吳淑珍
Wu Shu-chen
第十四世达赖喇嘛
FourteenthDalai Lama
宋教仁
Sung
司马南
Sima Nan
天海翼
Day sea wing
性高潮
Orgasm
蔡依林
Jolin
满族
Manchu
U-KISS
U-KISS
维基百科
Wikipedia
印度神油
Indian god oil
江泽民
Jiang
东京热
Tokyo Hot
正覺同修會
ChingPractitioners Association
六合彩
LOTTO
越南政黨列表
Vietnameseparty list
干
Dry
朱明
Zhu
中华人民共和国行政区划
People’s Republic of administrativedivisions
齡記書店
Ling Kee
2012年香港立法會選舉
2012 Hong Kong Legislative CouncilElection
真佛宗
TrueBuddha School
海椰子
Sea coconut
鶴佬陸上扒龍船
GrilledHoklo dragon boat onshore
馮光遠
Peng says
蔡瑞月
Tsai
衛星定位系統
Satellite positioning system
大连理工大学
DalianUniversity of Technology
拌麵
Noodles
MediaWiki
MediaWiki
利菁
Lee Ching
狭义相对论
SpecialRelativity
南京大學校友列表
Nanjing University Alumni List
李长春
LiChangchun
有栖川宮
Arisugawanomiya
郑州加州工业城
ZhengzhouCity of Industry, California
莫莉花
Molly Flowers
曾偉恩
Zeng Wayne
捕捉、絕育、釋放
Trap-Neuter-Release
自慰
Masturbation
顺阳范氏
Shun Yang Fan
港西镇_(崇明县)
Hong KongWest Town (Chongming County)
陳炳
Chen Ping
發正念
HairMindfulness
許瑜真
Xu Yu True
十二年國教學生研討會
Twelvenational seminar teaching students
任天堂溥天
Nintendo Popteam
南阳范氏
NanyangFan
阿坎巴羅雕像
Aquin Barrow Statue
伊卡黑石
IcaBlackstone
瘂弦
Ya Xian
呂泉生
LvQuansheng
中国大陆
Chinese mainland
丁先皇
DingXianhuang
天主教香港教區
Catholic Diocese of Hong Kong
蔡衍明
Tsai
小游戏
Small game
曹昂
Cao Ang
曹鑠
Cao Shuo
加拉帕戈斯象龜
Galapagostortoises
北西摩島
North Seymour Island
Hao123网址之家
Hao123website
中华人民共和国政府认定的邪教组织列表
PRC Government has identified alist of cult
被政府認定為邪教的團體列表
Identifiedby the government as a cult group list
馬三家女子勞教所
Masanjia Women’s Forced Labor Camp
神韵艺术团
Shen YunPerforming Arts
多面体
Polyhedron
巴丹群島
BatanesIslands
蘇家屯事件
Sujiatun
薄瓜瓜
Bo Guagua
拳王_(電視劇)
Muhammad (TV series)
朱雪璋
ZhuxueZhang
黑魔女學園
Black witch academy
缺宅男女
Lack ofhouse men and women
楊怡
Tavia
伏見宮
FushimiPalace
对毛泽东的评价
Evaluation of Mao
中国共产党中央委员会主席毛泽东同志支持美国黑人抗暴斗争的声明
ChineseCommunist Party Central Committee with Comrade Mao Zedong declared thestruggle to support African-American uprising
蟾蜍
Toad
闲院宫载仁亲王
CourtPalace, Prince Akishino idle load
皇室
Imperial family
當旺爸爸
When thebusy dad
李克强
Li Keqiang
幾米
Jimmy
维权运动
Rights movement
骨部
Bony part
行政院環境保護署
Environmental Protection Agency
南華足球隊
SouthChina Football Team
鍾嘉欣
Linda
行政院
ExecutiveYuan
朝鲜战争
Korean War
蔡宇傑
Cai Yujie
越南共和国
Republic of Vietnam
陳智雄
ChenZhixiong
陈良宇
Chen Liangyu
單戀雙城
Unrequitedlove Twins
交通部中央氣象局
Central Weather Bureau
中華民國教育部
Republicof China Ministry of Education
中華民國交通部
Republic of China Ministry ofTransportation
巨輪
Large ship
九江十二坊
Jiujiang twelve Square
香港獨立媒體
HongKong’s Independent Media
方志敏
Fang Zhimin
梁振英
LeungChun-ying
表情符号
Emoticons
雷霆掃毒
Thunderantidrug
摘星之旅
Reaching for the Stars Tour
Yummy_Yummy
YummyYummy
宣萱
Jessic