2016-08-23

The Wayback Machine reveals two decades of web tracking and third-party requests

The number of third parties sending information to and receiving data from popular websites has increased dramatically in the past 20 years, which means that visitors to those sites may be more closely watched by major corporations and advertisers than ever before, according to a new analysis of web tracking.

A team from the University of Washington reviewed two decades of third-party requests by using Internet Archive’s Wayback Machine. They found a four-fold increase in the number of requests logged on the average website from 1996 to 2016, and say that companies may be using these requests to more frequently track the behavior of individual users.

The authors—Adam Lerner, a PhD candidate, and Anna Kornfeld Simpson, a doctoral student, along with collaborators Tadayoshi Kohno and Franziska Roesner—found popular websites make an average of four third-party requests in 2016, up from less than one in 1996. However, those figures are likely an underestimate of the prevalence of such requests because of limitations of the data contained within the Wayback Machine. Roesner calls their findings “conservative.”

For comparison, a Princeton study of one million websites released in January and led by computer science researcher Arvind Narayanan found that top websites host an average of 25 to 30 trackers. Chris Jay Hoofnagle, a privacy and law scholar at UC Berkeley, says his own research has found that 36 of the 100 most popular sites send more than 150 requests each, with one site logging more than 300. The definition of a tracker or a third-party request, and the methods used to identify them, may also vary between analyses.

“It’s not so much that I would invest a lot of confidence in the idea that there were X number of trackers on any given site,” Hoofnagle says of the University of Washington team’s results. “Rather, it’s the trend that’s important.”

Most third party requests are made through cookies, which are snippets of information that are stored in a user’s browser. Those snippets enable users to automatically log in or add items to a virtual shopping cart, but they can also be recognized by a third party as the user navigates to other sites.

For example, a national news site called todaysnews.com might send a request to a local realtor to load an advertisement on its home page. Along with the ad, the realtor can send a cookie with a unique identifier for that user, and then read that cookie from the user’s browser when the user navigates to another site where the realtor also advertises.

In addition to following the evolution of third party requests, the team also revealed the dominance of players such as Google Analytics, which was currently present on nearly a third of the sites analyzed in the study. In the early 2000s, no third party appeared on more than 10 percent of sites. And in the early 2000s, only about 5 percent of sites sent five or more third party requests. Today, nearly 40 percent do. But there’s good news, too: pop-ups seem to have peaked in the mid-2000s.

Narayanan says he has noticed another trend in his own work: consolidation within the tracking industry, with only a few entities such as Facebook or Google’s DoubleClick advertising service appearing across a high percentage of sites. “Maybe the world we’re heading toward is that there’s a relatively small number of trackers that are present on a majority of sites, and then a long tail,” he says.

Many privacy experts consider web tracking to be problematic, because trackers can monitor a user’s behavior as they move from site to site. Combined with publicly-available information from personal websites or social media profiles, this behavior can enable retailers or other entities create identity profiles without a user’s permission.

“Because we don’t know what companies are doing on the server side with that information, for any entity that your browser talks to that you didn’t specifically ask it to talk to, you should be asking, ‘What are they doing?’” Roesner says.

But while every web tracker requires a third-party request, not every third-party request is a tracker. Sites that use Google Analytics (including IEEE Spectrum) make third-party requests to monitor how content is being used. Other news sites send requests to Facebook so the social media site can display its “Like” button next to articles and permit users to comment with their accounts. That means it’s hard to tell from this study whether tracking itself has increased, or if the number of third-party requests has simply gone up.

Modern ad blockers can prevent sites from installing cookies and have become popular with users in recent years. Perhaps due in part to this shift, the authors also found that the behaviors that third parties exhibit have become more sophisitcated and broadened in scope. For example, a new tactic avoids the use of cookies by recording a users’ device fingerprints, or identifiable characteristics such as screen size of their smartphone, laptop, or tablet.

When they began their analysis, the University of Washington researchers were pleased to find that the Wayback Machine could be used to track cookies and device fingerprinting through its storage of the original JavaScript code, which allows them to determine which JavaScript APIs are called on each website. Therefore, a user who is perusing the archived version of a site in the Wayback Machine winds up making all the same requests that the site was programmed to make at the time.

The researchers embedded their tool, which they call TrackingExcavator, in a Chrome browser extension and configured it to allow pop-ups and cookies. They instructed the tool to inspect the 500 most popular sites, as ranked by Amazon’s web analytics subsidiary Alexa, for each year of the analysis. As it browsed the sites, the system recorded third-party requests and cookies, and the use of particular JavaScript APIs known to assist with device fingerprinting. The tool visited each site twice, once to “prime” the site and again to analyze whether requests were sent.

Until now, the team says no academic researchers had found a way to study web tracking before 2005. They presented their work at the USENIX Security Conference in Austin, Texas earlier this month.

Hoofnagle of UC Berkeley says the use of the Wayback Machine was a clever approach and could inspire other scholars to mine archival sites for other reasons. “I wish I had thought of this,” he says. “I’m totally kicking myself.”

Still, there are plenty of holes in the archive that limit its usefulness. For example, some sites prohibit automated bots such as those used by the Wayback Machine from perusing them.

Show more