Glinden.blogspot.com

Browsing behavior for web crawling

2011-12-01

A recent paper out of Yahoo, "Discovering URLs through User Feedback" (ACM), describes the value from using what pages people browse to and click on (which is in Yahoo's toolbar logs) to inform their web crawler about new pages to crawl and index.

From the paper:
Major commercial search engines provide a toolbar software that can be deployed on users' Web browsers. These toolbars provide additional functionality to users, such as quick search option, shortcuts to popular sites, and malware detection. However, from the perspective of the search engine companies, their main use is on branding and collecting marketing statistics. A typical toolbar tracks some of the actions that the user performs on the browser (e.g., typing a URL, clicking on a link) and reports these actions to the search engine, where they are stored in a log file.

A Web crawler continuously discovers new URLs and fetches their content ... to build an inverted index to serve [search] queries. Even though the basic mechanism of a crawler is simple, crawling efficiently and eff ectively is a difficult problem ... The crawler not only has to continuously enlarge its repository by expanding its frontier, but also needs to refresh previously fetched pages to incorporate in its index the changes on those pages. In practice, crawlers prioritize the pages to be fetched, taking into account various constraints: available network bandwidth, peak processing capacity of the backend system, and politeness constraints of Web servers ... The delay to discover a Web page can be quite long after its creation and some Web sites may be only partially crawled. Another important challenge is the discovery of hidden Web content ... often ... backed by a database.

Our work is the first to evaluate the benefits of using the URLs collected from a Web browser toolbar as a form of user feedback to the crawling process .... On average, URLs accessed by the users are more important than those found ... [by] the crawler ... The crawler has a significant delay in discovering URLs that are first accessed by the users ... Finally, we [show] that URL discovery via toolbar [has a] positive impact on search result quality, especially for queries seeking recently created content and tail content.
The paper goes on to quantify the surprisingly large number of URLs found by the toolbar that are useful, not private, and not excluded by robots.txt. Importantly, a lot of these are deep web pages, only visible by doing a query on a database, and hard to ferret out of that database any way but looking at the pages people actually look at.

Also interesting are the metrics on pages the toolbar data finds first. People often send links to new web pages by e-mail or text message. Eventually, those links might appear on the web, but eventually can be a long time, and many of the urls found first in the toolbar data ("more than 60%") are found way before the crawler manages to discover them ("at least 90 days earlier than the crawler").

Great paper out of Yahoo Research and a great example of how useful behavior data can be. It is using big data to help people help others find what they found.