Vol: 5 • Issue: 4 • April 2015 [contribute] [
archives]
Military history, cricket, and Australia targeted in Wikipedia articles’ popularity vs. quality; how copyright damages economy
With contributions by: Niklas Laxström, Federico Leva, Masssly, Gamaliel and Piotr Konieczny
Contents
1 Popularity does not breed quality (and vice versa)
2 Excessive copyright terms proven to be a cost for society, via English Wikipedia images
2.1 Context
2.2 Findings
2.3 Methodological nitpicks
3 Briefly
3.1 “Automatic Text Summarization of Wikipedia Articles”
3.2 Relationship between Google searches and Wikipedia edits
3.3 How much of the Amazon rainforest would it take to print out Wikipedia?
3.4 Perceptions of bot services
4 Other recent publications
5 References
Popularity does not breed quality (and vice versa)
This paper[1] provides evidence that quality of an article is not a simple function of its popularity, or, in the words of the authors, that there is “extensive misalignment between production and consumption” in peer communities such as Wikipedia. As the author notes, reader demand for some topics (e.g. LGBT topics or pages about countries) is poorly satisfied, whereas there is over-abundance of quality on topics of comparatively little interest, such as military history.
Rank
Popular and underdeveloped topics
High-quality, not popular topics
1
Countries
Cricket
2
Pop music
Tropical cyclones
3
Internet
Middle Ages
4
Comedy
Politics
5
Technology
Fungi
6
Religion
Birds
7
Science Fiction
Military history
8
Rock music
Ships
9
Psychology
England
10
LGBT studies
Australia
Illustration from Wedding, cited as an example for start-class articles which ought to be featured articles if quality ratings were perfectly aligned with popularity
The authors arrived at this conclusion by comparing data on page views to articles on English, French, Russian, and Portuguese Wikipedias to their respective Wikipedia:Assessment (and like) quality ratings. The authors note that at most 10% of Wikipedia articles are well correlated with regards to their quality and popularity; in turn over 50% of high quality articles concern topics of relatively little demand (as measured by their page views). The authors estimate that about half of the page views on Wikipedia – billions each month – are directed towards articles that should be of better quality, if it was just their popularity that would translate directly into quality. The authors identify 4,135 articles that are of high interest but poor quality, and suggest that the Wikipedia community may want to focus on improving such topics. Among specific examples of extremes are articles with poor quality (start class) and high number of views such as wedding (1k views each day) or cisgender (2.5k views each day). For examples of topics of high quality and little impact, well, one just needs to glance at a random topic in the Wikipedia:Featured articles – the authors use the example of 10 Featured Articles about the members of the Australian cricket team in England in 1948 (itself a Good Article; 30 views per day). Interestingly, based on their study of WikiProjects, popularity and quality, the authors find that contrary to some popular claims, pop culture topics are also among those that are underdeveloped. The authors also note that even within WikiProjects, the labor is not efficiently organized: for example, within the topic of military history, there are numerous featured articles about individual naval ships, but the topics of broader and more popular interests, such as about NATO, are less poorly attended too. In conclusion, the authors encourage the Wikipedia community to focus on such topics, and to recruit participants for improvement drives using tools such as User:SuggestBot.
Excessive copyright terms proven to be a cost for society, via English Wikipedia images
Within a sample of US bestseller authors, what effect may the addition of this image to the article Michael Gold have had on its traffic?
Paul J. Heald and his coauthors at the University of Glasgow continued their extremely valuable studies of the public domain, publishing “The Valuation of Unprotected Works”.[2] The study finds that “massive social harm was done by the most recent copyright term extension that has prevented millions of works from falling into the public domain since 1998″ which “provides strong justification for the enactment of orphan works legislation.”
Context
In recent years, authorities have started acknowledging possible errors in copyright legislation of the past, which would have been prevented by an evidence-based approach. Heald mentions the Hargreaves Report (2011), endorsed by the UK’s IP office, but other examples can be found in World Intellectual Property Organization reports. This awakening corresponds to the work by researchers and think tanks to prove the importance of public domain and certain damages of copyright.[supp 1]
The importance of evidence-based legislation can’t be overstated, especially in the current process of EU copyright revision.
As Heald notes, past copyright policy has relied on a number of incorrect assumptions, in short:
that private value equates social welfare, i.e. that any payment associated to copyright makes society richer;
that the only private value is generated by sales under copyright monopoly;
that absence of copyright reduces both distribution and associated payments.
Recent studies, some of which mentioned in this paper (Pollock, Waldfogel, Heald), have instead found strong indicators that:
consumer surplus (i.e. amounts saved by consumers) can be higher and hence contribute more to social welfare;
absence of copyright may produce higher private value as well;
works under traditional copyright, especially given the phenomenon of orphan works, don’t manage to cover the entire market, resulting in a loss of knowledge distribution as well as of potential sales.
In short, it seems that “the public is better off when a work becomes freely available”, insofar as copyright has been “robust enough to stimulate the creation of the work in the first place” and that a work “must remain available to the public after it falls into the public domain”.
Findings
However, it is impossible to measure the value of knowledge acquired by society and, even considering the mere monetary value, it is impossible to measure transactions which did not happen. The English Wikipedia is used by the authors as dataset because its history is open to inspection and its content is unencumbered by copyright payments, so every “transaction” is public.
In particular, the study measures what would be the cost of gratis images not being available for use on English Wikipedia articles, as a proxy of the consumer surplus generate by those images, as a proxy of their private value, and as a proxy of their contribution to social welfare. If a positive value is found, it is proven that a more restrictive copyright would be harmful and we can reasonably infer that reducing copyright restrictions would make society richer.
The calculation is done in three passages.
362 authors of New York Times bestsellers of 1895–1969 are considered. Their English Wikipedia articles are checked for inclusion of portraits and copyright status thereof; the increase in page views caused by the presence of the image is calculated. To depurate other factors, authors are compared in “matched pairs” of similar popularity as suggested by Amazon review or pageviews in mid 2009. Only the lowest scoring months are considered, the general increase in pageviews is discounted, etc.
The first proxy considered is how much it would cost to buy the images from traditional image sellers, in the hypothetical (and absurd) case that article authors were allowed to. Such an image typically costs around 100 $ even if it is in the public domain or identical to the one used by our articles.
The second proxy is how much the added pageviews are worth in terms of potential advertising revenue (0.0053 $/view, according to [1]).
The values are then validated on a different dataset, some hundreds composers and lyricists.
The amounts are then expanded proportionally to all English Wikipedia articles by considering images and pageviews of a sample of 300 articles.
Clearly, the number of inferences is great, but the authors believe the findings to be robust. The pageview increase, depending on the method, was 6%, 17% or 19%, and at any rate positive. Authors with most images were those died before 1880, an outcome which has no possible technological reason nor any welfare justification: it’s clearly a distortion produced by copyright.
For those fond of price tags, the English Wikipedia images were esteemed to be worth about 30,000 $/year for those 362 writers, or about 30 M$ in hypothetical advertising revenue for English Wikipedia, or M$200–230 in hypothetical costs of image purchase.
At any rate, this reviewer thinks that the positive impact of the lack of copyright royalties is proven and confirms the authors’ thesis. It is quite challenging to extend the finding to the whole English Wikipedia, all Wikimedia projects, the entire free knowledge landscape and finally the overall cultural works market; and even more fragile to put a price tag on it. However, this kind of one-number communication device is widely used to explain the impact of legislation and numbers traditionally used by legislators are way more fragile than this. Moreover, the study makes it possible to prove a positive impact on important literature authors and their life, i.e. their reputation, which is supposed to be the aim of copyright laws, while financial transactions are only means.
Methodological nitpicks
There are several possible observations to be made about details of the study.
Only few hundreds articles were considered, and only on the English Wikipedia. Measuring pageviews is not explained in detail, but it clearly relied on stats.grok.se, on whose limitations see the stats.grok.se FAQ and Research:Page view.
Special:Random is not able to produce a representative set of the English Wikipedia, let alone of the whole Wikipedia. In fact, it relies on a pseudo-random method which is not very random. (A more random method, based on ElasticSearch, was briefly enabled but then disabled for performance reasons.)
The author uses an artificial definition of “public domain” to match the cases which the study was able to measure, i.e. gratis images. Only 67 % of the images were in the public domain while 13% were in fair use and 19% released in some way by the author. As for the releases by the authors, all cases are confusingly conflated: in particular “a Creative Commons” and “unprotected” are two incorrect terms used, which fail to recognise that CC images are copyrighted works and that not all CC images are free cultural works. This mix makes it hard to extend the results to the public domain proper, i.e. the works without any copyright protection, as well as to Wikimedia projects other than the English Wikipedia where fair use is less common. This may not affect the result on the welfare impact for the English Wikipedia but has a higher impact on the dates: namely, the fact that people who died before 2000 have less images may just mean that the English Wikipedia rules allowed fair use more for them because Wikipedia photographers would not be able to shoot photos themselves.
Again on terminology, it is disappointing that Wikipedia’s article authors are called “page builders”, as if they were mechanical workers (with all due respect for mechanical workers). There is no reason to reserve the term “authors” to the professional writers who are the subjects of those articles. An artificial restriction of the pool of people who can assert to be “authors” is one of the main propaganda tools of the “pro-copyright” lobby.
Briefly
“Automatic Text Summarization of Wikipedia Articles”
The authors of this paper[3] built neural networks using different features to pick sentences to summarize (English?) Wikipedia articles. They compared their results to Microsoft Word 2007 and found out results are very different.
Relationship between Google searches and Wikipedia edits
A student course paper[4] developed a model to find a correlation between the number of searches on Google resulting from an increased public interest in a subject, and the number of edits made to that subject’s corresponding Wikipedia page. Google Trends data from 2012 for “Barack Obama”, “Google“ and “Mathematics” was compared with Wikipedia page revisions of the corresponding articles within the same period. Instead of the actual data, which was unavailable, the paper applied approximation techniques to estimate the number of Google searches and the number of Wikipedia edits during a given period. Except for a few instances of spikes matching up, no clear correlation between Google searches and Wikipedia edits was found. Similar results were observed when more graphs were generated for different topics. The model made no provision for disproving the existence of a correlation. These limitations render the results of the study still inconclusive.
How much of the Amazon rainforest would it take to print out Wikipedia?
Two students at the University of Leicester have produced a thought-provoking mathematical illustration[5] of the scope of the Internet by calculating how much of the Amazon rainforest would be consumed if the entire Internet were printed on standard A4-size sheets of paper. Their conclusion is about 2% for the entire Internet, and 2.1 × 10−6% for the English Wikipedia, the size of which they used to extrapolate the size of the rest of the Internet. Their calculations are based on a random sample of only ten pages, the average size of which they multiplied by the number of Wikipedia articles, which at the time was 4.7 million. Given the wealth of quantitative data available about Wikipedia, and that Wikipedia articles vastly range in size from a sentence or two up to the 784K byte article List of law clerks of the Supreme Court of the United States, perhaps more accurate estimates could have been made.
Perceptions of bot services
This study[6] looked at how Wikipedians perceive bots, to enhance our understanding of the relationship between human and bot editors. The authors find that the bots are perceived as either “servants” or “policemen”. Overall, the bots are well accepted by the community, a factor the authors attribute to the fact that most bots are clearly labelled as and seen as extensions of human actors (tools used by advanced Wikipedians). The authors nonetheless observe that where bots make large number of minor edits, they are most likely to attract criticism. Still, the necessity for such labor, maintaining categories, templates and such, is, according to actors, a widely recognized and accepted element of Wikipedia’s life.
Other recent publications
A list of other recent publications that could not be covered in time for this issue – contributions are always welcome for reviewing or summarizing newly published research.
“P2Pedia: a peer-to-peer wiki for decentralized collaboration”[7] (screencast demo; see also w:User:HaeB/Timeline_of_distributed_Wikipedia_proposals)
‘“Distributed wikis: a survey”[8] From the abstract: “We identify three classes of distributed wiki systems, each using a different collaboration model and distribution scheme for its pages: highly available wikis, decentralized social wikis and federated wikis.”
“Detection speculations using active learning” (“Deteccion de Especulaciones utilizando Active Learning”)[9](student thesis in Spanish, about the detection of weasel words on the English Wikipedia)
References
↑ Morten Warncke-Wang, Vivek Ranjan, Loren Terveen, and Brent Hecht (2015). “Misalignment Between Supply and Demand of Quality Content in Peer Production Communities”. http://www-users.cs.umn.edu/~bhecht/publications/wikipedia_supplydemandquality_icwsm2015.pdf.
↑ Heald, Paul J. and Erickson, Kris and Kretschmer, Martin, “The Valuation of Unprotected Works: A Case Study of Public Domain Photographs on Wikipedia” (February 4, 2015). DOI:10.2139/ssrn.2560572 Available at SSRN: http://ssrn.com/abstract=2560572
↑ Hingu, Dharmendra; Shah, Deep; Udmale, Sandeep S. (January 2015). “Automatic text summarization of Wikipedia articles”. 2015 International Conference on Communication, Information Computing Technology (ICCICT). 2015 International Conference on Communication, Information Computing Technology (ICCICT). DOI:10.1109/ICCICT.2015.7045732.
↑ Claire, Charron (2014). “Analysing Trends Between US Google Searches and English Wikipedia Page Edits“.
↑ Harwood, George (2015). “How Much of the Amazon Would it Take to Print the Internet?”. Journal of Interdisciplinary Science Topics 4. Centre for Interdisciplinary Science, University of Leicester.
↑ Clément, Maxime; Guitton, Matthieu J. (September 2015). “Interacting with bots online: Users’ reactions to actions of automated programs in Wikipedia“. Computers in Human Behavior 50: 66–75. doi:10.1016/j.chb.2015.03.078. ISSN 0747-5632.
↑ Davoust, Alan; Alexander Craig, Babak Esfandiari, Vincent Kazmierski (2014-10-01). “P2Pedia: a peer-to-peer wiki for decentralized collaboration“. Concurrency and Computation: Practice and Experience. doi:10.1002/cpe.3420. ISSN 1532-0634.
↑ Davoust, Alan; Hala Skaf-Molli, Pascal Molli, Babak Esfandiari, Khaled Aslan (2014-11-01). “Distributed wikis: a survey“. Concurrency and Computation: Practice and Experience. doi:10.1002/cpe.3439. ISSN 1532-0634.
↑ Benjamín Machíın Serna: “Deteccion de Especulaciones utilizando Active Learning”. Student thesis, Universidad de la República – Uruguay, 2013 PDF)
Supplementary references and notes:
↑ The most important of these initiatives is probably the 2009 Public Domain Manifesto. Some examples in the context of orphan works: Italian cultural heritage on the Wikimedia projects#Advocating for the public domain bibliography commented in an Italian paper by this reviewer. http://arxiv.org/abs/1411.6675
Wikimedia Research Newsletter
Vol: 5 • Issue: 4 • April 2015
This newletter is brought to you by the Wikimedia Research Committee and The Signpost
Subscribe: Email • [
archives] [signpost edition] [contribute] [research index]