2015-08-06

← Older revision

Revision as of 16:35, 6 August 2015

Line 1:

Line 1:

+

=Review of a multilingual 500K sample=

+

== Introduction ==

== Introduction ==

In an effort to reduce the rate of Wikipedia search queries that produce no results (see the [[User:Deskana (WMF)/Proposals/Zerorate|Discovery team's proposal]]), I've undertaken a manual review of three batches of 500,000 full-text queries that returned no results (taken from the top 52 wiki's, with 100K+ articles—for future reference, at this time that's en, sv, de, nl, fr, war, ru, it, ceb, es, vi, pl, ja, pt, zh, uk, ca, fa, sh, no, ar, fi, id, ro, hu, cs, sr, ko, ms, tr, min, eo, kk, eu, da, sk, bg, hy, he, lt, hr, sl, et, uz, gl, nn, vo, la, simple, el, hi, & ce).

In an effort to reduce the rate of Wikipedia search queries that produce no results (see the [[User:Deskana (WMF)/Proposals/Zerorate|Discovery team's proposal]]), I've undertaken a manual review of three batches of 500,000 full-text queries that returned no results (taken from the top 52 wiki's, with 100K+ articles—for future reference, at this time that's en, sv, de, nl, fr, war, ru, it, ceb, es, vi, pl, ja, pt, zh, uk, ca, fa, sh, no, ar, fi, id, ro, hu, cs, sr, ko, ms, tr, min, eo, kk, eu, da, sk, bg, hy, he, lt, hr, sl, et, uz, gl, nn, vo, la, simple, el, hi, & ce).

Line 363:

Line 365:

]

]

}</graph>

}</graph>

+

+

+

=Full manual review of a 1K enwiki sample=

+

+

==Introduction==

+

+

Looking for large recurring patterns of searches will not reveal the frequency of idiosyncratic erorrs (like typos, gibberish, and queries in foreign languages) that can't be easily be recognized automatically. So I undertook a manual review of 1047 randomly sampled queries from from one day's worth of logs of enwiki full-text searches, starting on 2015-07-29. This was a random sample (or as random as pseudorandom can be) rather than an every-N sample.

+

+

===Caveats===

+

+

All categories are at least somewhat subjective, and depend in part on my ability to recognize (or uncover) user intent. Many items could have been put in multiple categories, but I chose just one each time (some comments about this are included below). I know others would disagree on some categories, and I know that there are errors in categorization (i.e., I'd disagree with myself), but the overall trends are still illustrative.

+

+

==Requestors==

+

+

Overall, 67.2% were requested via API, 32.8% via web.

+

+

==Typos==

+

+

I broke typos up into two categories: apparent mistakes, and incomplete words or phrases. Note that the vast majority in the incomplete category come via API, hinting that some app may be sending incomplete queries.

+

+

153 14.6% TYPO, 54.2% web, 45.8% api

+

93 8.9% INCOMP, 97.8% api, 2.2% web

+

Total: 23.5%

+

+

===Typo autosuggestions===

+

+

I took a random sample of 40 of the queries in the TYPO category and searched for them in enwiki via the web. The results were oddly evenly distributed:

+

+

10/40 = 25.0% zero results

+

10/40 = 25.0% some results

+

20/40 = 50.0% correct results

+

+

Half had clearly correct suggestions, a quarter had no results, and a quarter had some non-zero results that were either wrong, or not clearly correct.

+

+

===Typo reverse index===

+

+

By manual inspection, 20 of the 153 TYPOs (13.1%) had an error in the first two characters of one of the search terms, and thus might benefit from a reverse index.

+

+

==Previously seen categories==

+

+

These are the previously discussed large categories of zero-return queries. The distributions are different from previous samples, sometimes drastically different. This could be attributed to the current small sample size, the time skew in the earlier sample, day-of-the-week effects, and random vagaries of millions of people searching—though that last can account for almost any variance!

+

+

125 11.9% AND, 100.0% api "Article_title" AND "title of link taken from article"

+

91 8.7% UNIX, 100.0% api Unix Timestamps

+

29 2.8% FILM, 100.0% web TV Episodes / Movies—"..." film

+

28 2.7% QUOT, 100.0% api quot

+

20 1.9% DOI, 100.0% api DOI

+

3 0.3% TERM+, 100.0% api term+term+term / term+term+term country

+

Total: 28.3%

+

+

==Foreign languages==

+

+

Though enwiki has many pages with titles in other languages, these searches didn't get any results. I didn't dig too deeply into most of them. However, numerous entires filed under MOVIES, MUSIC, and YOUTUBE are in Spanish, Portuguese, Turkish, or other languages.

+

+

13 1.2% CHINESE, 61.5% api, 38.5% web

+

12 1.1% ARABIC, 58.3% web, 41.7% api

+

7 0.7% CYRILLIC, 57.1% web, 42.9% api

+

7 0.7% TAGALOG, 85.7% web, 14.3% api

+

6 0.6% SPANISH, 83.3% api, 16.7% web

+

5 0.5% MALAY, 80.0% web, 20.0% api

+

3 0.3% GERMAN, 66.7% api, 33.3% web

+

3 0.3% DEVTRANS, 100.0% api

+

2 0.2% NORWEGIAN, 100.0% api

+

2 0.2% SWAHILI, 100.0% web

+

1 each (0.1%, 100% api) for GREEK, THAI, TAMIL

+

1 each (0.1%, 100% web) for PORTUGUESE, DUTCH, FRENCH, LATIN, SWEDISH, ITALIAN, CROATIAN, HINDI, ESTONIAN, HMONG, KANNADA, FINNISH

+

+

Total: 7.2%

+

+

+

==Mystery queries==

+

+

These are ones that I just couldn't figure out. They weren't clearly junk. They could be typos, but often they are are too ambiguous.

+

+

66 6.3% ??, 71.2% api, 28.8%

+

+

==Not encyclopedic==

+

+

These categories are potentially problematic, since they may depend in part on what you think should or should not be in Wikipedia.

+

+

===Wrong website===

+

+

* PROD: These are queries that appear to be about general or specific products, including video games, clothing brands, drugs, decorations, laptop replacement parts, etc.

+

* QUESTION: These are queries that seem to be asking for non-encyclopedic information, including advice on romance, study habits, and home furnishing, job searches, making travel arrangements, celebrity facts, scholarly research, etc.

+

* URL: These look like they are or tried to be URLs.

+

* NEWS: Questions about current events.

+

* TWITTER: Actual tweets or parts of tweets.

+

+

57 5.4% PROD, 66.7% api, 33.3% web

+

45 4.3% QUESTION, 57.8% web, 42.2% api

+

32 3.1% URL, 62.5% web, 37.5% api

+

3 0.3% NEWS, 66.7% web, 33.3% api

+

2 0.2% TWITTER, 100.0% api

+

Total: 13.3%

+

+

===Content===

+

+

These appear to be searches for particular content, including particular songs, albums, music by a particular artist, movies, TV episodes, books, or scholarly articles. The YOUTUBE category are queries that exactly match the titles of individual YouTube videos. ISBNs are specific, plain ISBN numbers.

+

+

42 4.0% MUSIC, 78.6% api, 21.4% web

+

25 2.4% YOUTUBE, 92.0% api, 8.0% web

+

10 1.0% MOVIE, 50.0% api, 50.0% web

+

7 0.7% ARTICLE, 85.7% web, 14.3% api

+

7 0.7% ISBN, 71.4% web, 28.6% api

+

Total: 8.8%

+

+

===People and places===

+

+

These are queries for particular people or places that are not in Wikipedia, including named individuals or parts of names (PERSON), online usernames (USER), addresses of business (ADDRESS—note that all are in Las Vegas) or email addresses.

+

+

The LINKED category are searches in this form:

+

"SURNAME" "FIRST MIDDLE" "COMPANY" LINKEDIN

+

Both LINKEDIN and VIADEO (professional social networking sites) were used.

+

+

26 2.5% PERSON, 53.8% web, 46.2% api

+

16 1.5% USER, 56.2% api, 43.8% web

+

10 1.0% ADDRESS, 100.0% api

+

4 0.4% LINKED, 100.0% api

+

2 0.2% EMAIL, 100.0% web

+

Total: 5.6%

+

+

===Misc===

+

+

Stuff I could at least partly identify, but couldn't categorize elsewhere. One was just a number, one a slightly mangled Wikimedia Commons file name, the other a Wikimedia commons category name.

+

+

2 0.2% COMMONS, 50.0% api, 50.0% web

+

1 0.1% NUMBER, 100.0% web

+

Total: 0.3%

+

+

==Junk==

+

+

* JUNK includes snippets of larger texts, multiply repeated letters, keyboard banging, and the like.

+

* OCR are "words" that seem to appear primarily as OCR errors in Google Books.

+

* ERRORs include "search_suggest_query" and the like.

+

* EMOJI are strings of emoji characters.

+

* SPAM inlcude an actual advertisement and what looks like a hacking probe attempt.

+

* NODE is a pattern that's come up before: Iamlookingfor[...]node[...]

+

+

34 3.2% JUNK, 85.3% web, 14.7% api

+

9 0.9% OCR, 55.6% api, 44.4% web

+

3 0.3% ERROR, 66.7% api, 33.3% web

+

3 0.3% EMOJI, 100.0% web

+

2 0.2% SPAM, 50.0% api, 50.0% web

+

1 0.1% NODE, 100.0% web

+

Total: 5.0%

+

+

==Misses==

+

+

Actual contentful queries that there were no entries for.

+

+

* DICT: mostly obscure words, but they are in Wiktionary

+

* SPECIES: Latin species names

+

* MISS: other misses that seem to describe reasonable things that are or could be in Wikipedia, but either aren't, or weren't found

+

+

12 1.1% MISS, 50.0% api, 50.0% web

+

4 0.4% SPECIES, 75.0% web, 25.0% api

+

5 0.5% DICT, 80.0% api, 20.0% web

+

Total: 2.0%

Show more