2015-11-16

Browsing the web with an ad blocker is certainly nothing new, but 2015 is the year that the practice is going mainstream. Apple not only made it easy for iOS 9 users to install ad blockers on their mobile devices, but also introduced the concept of ad blocking to a whole new audience.  How big is the audience that’s using ad blockers?

“Today, 34 percent of web users have installed some sort of ad blocker. Eighteen percent of tablet users have installed one, and almost a quarter of all mobile phone users—24 percent—have installed an ad blocker.”

VentureBeat, November 2015

34% is a large percentage, and it’s growing by the month!

Whatever your feelings on the ethics of ad blocking, widespread adoption of them is a game changer for everyone on the web. Ad networks and sites that rely on ad revenue (Safari does not) are getting the most attention, but site usability is a concern for web developers and customers alike. Even if you don’t serve ads, ad blockers are throwing the practice of web analytics into a period of uncertainty and change.

Looking for some good news? I think those of us in the analytics space have gotten pretty lazy when it comes to how and what we measure. A little crisis is just what we need to rethink our practices and refocus on what matters. Tools like Google Analytics and Adobe Analytics have become so commonplace that the competitive advantage of relying on standard metrics has virtually disappeared.  Sure, we all have to track visits, page views, conversions and so on, but doing so is a basic need rather than a way to get ahead.

Understanding the effect of ad blockers and cookies

There are two major challenges that ad blockers (and some forms of  “private browsing”) throw at traditional web analytics:

Most ad blockers disable “tracking pixels” from Google Analytics and other vendors. Third party web analytics tools then cannot track users with blockers enabled unless the visitor whitelists your site (best of luck!)

Ad blockers give the user the ability to block both first- and third-party cookies. Many sites require first party cookies to function at all so users tend to disable them less often, but it’s a problem for a segment of traffic when they first arrive nonetheless.

It’s not hard to imagine a day in the near future where half of your customers will have an ad blocker installed. How many are going to whitelist GA tracking pixels on your site?  Blocking third-party cookies has been on the rise for a number of years, and so most analytics packages are equipped to operate with a combination of 1st party cookies and tracking pixels. If those pixels are blocked, you’ll be in the dark.

Most commercial websites want to track users from entry to exit in a few high-level scenarios:

Non-logged-in site usage that never results in a “sale”. In this case a visitor is browsing the site, but never creates an account, logs in, or spends any money on your site.  This is STILL an interesting interaction and we want to track and understand it. In this case it’s possible for the visitor to refuse both first and third party cookies and block common analytics tracking pixels.

Non-logged-in site usage that results in a “sale”. In this case a human browses your site, and ends up creating an account, logging in, or buying something. Once a customer is logged in, you’re guaranteed that they are accepting your first party cookies, but it’s possible they did not prior to logging in. In addition, they may be blocking third party cookies and common analytics tracking pixels.

A visitor is already logged in when they arrive (from a previous visit) and stay logged in the whole time.  In this case you’re guaranteed that they are accepting your first party cookies, but may be blocking third party cookies and common analytics tracking pixels.

In simpler terms:



That’s not a pretty picture! While there has never been a perfect view of what’s happening on your site, the value of third-party tracking is diminishing at an alarming rate.  If you’re in the business of making money on the web, the data about behavior on your web properties is a competitive advantage. You need to own it.

Taking ownership of your site activity

The first step is to collect the data generated by user events on your site. Common events you’ll want to track include:

Page loads

API requests

Client-side events

Start with your web server logs. For high-volume sites like Safari, those logs come from your Content Delivery Network (CDN). Most CDNs give you the capability to get your logs streamed to Amazon S3 or a local file share. If you don’t use a CDN, you can pull the logs right from your web host(s).

Web server logs are a blessing and a curse. They typically give you lots of data, but not everything you need and plenty of information you don’t. You will certainly get page loads, but may need to explicitly include first-party cookies in your configuration (we did). We also realized that we had a lot of client-side events that didn’t show up in the web logs. That’s a harder problem to solve, but thankfully we could still infer most actions with a combination of page view records plus API requests. As for data we don’t always want?  Let’s just say that not every analyst wants to be bogged down with a record of each image request made on our site.

Raw log files don’t do the business much good, so the next step is to get that useful data into a place where you can process it. We took a two-step approach. We knew the volume of data in our web logs would eventually make ingesting and processing it in a traditional relational database unrealistic, but in order to get off the ground quickly we began with a locally hosted Postgres database. My colleague, James Stevenson, wrote extensively about the process of ingesting the log data using Haskell, which has proven to be quite scalable and flexible, and critically, independent of the final host datastore.

Creating a clickstream database

The concept of a clickstream database is nothing new. The goal is to tie each record in the web log (the “click”) to a visitor, and then be able analyze what that visitor does. It really gets interesting when you have thousands, or millions, of visitors! This is essentially what Google Analytics or Adobe Analytics do behind the scenes, but now you own the data.

Once the logs are parsed and living in a place where they can be queried, it’s time to make sense of the information. This is where the volume of data can become problematic, and why we migrated our clickstream database from Postgres to Amazon Redshift. The main clickstream table contains a row for each log event and is both deep and wide. To make any sense of the data, you must run some pretty strenuous queries to do the following:

Attribute as many records to visitors (and visitors who become customers) as possible.

Determine which “visits” matter. In other words, calculate  visit attribution.

Create summary tables for key metrics and events that you’d like to analyze and report on.

Accommodate ad-hoc queries to answer one-off questions, build statistical models, and explore the data.

The first one is the hardest but it makes the other three possible. It’s also the one that throws third party analytics tools for the biggest loop. Since we’re collecting the data ourselves on the server side, we don’t care if the visitor is blocking third party cookies and tracking pixels. We’d love to have them accept first party cookies, but it’s not a total loss if they don’t. Log events for visitors who never identify themselves by signing up and/or logging can be strung together either by first party cookies (if they are set) or a combination of cookies, IP address, user agent, timestamps and other log data. If a customer is accepting first party cookies, or even better is logged in, the task is far easier.

Once that’s done, it’s up to you to decide what matters to you. Visit attribution is a discipline of its own but now you’ve got the raw materials to get it right, even if your visitors are using an ad blocker. Because clickstream databases often live in something like Redshift or Hadoop, it’s necessary to build tables that summarize key metrics and events and send those back to your traditional data warehouse. A common use case is to create a summary table of the first and last visit prior to a transaction for each visitor and use that much smaller table in your BI tools for reporting purposes.

One thing to remember is that you’ll never attribute all of your clickstream records to visitors, and even fewer to actual customers.  That’s OK! It’s not a perfect science, but if you do even a decent job you unlock the real power of your clickstream database: to empower your team to go beyond counting visits and basic visit attribution.



Benefits of clickstream beyond traditional web analytics

Ad blockers or not, having a clickstream database provides tremendous value for data scientists, analysts and engineers in your organization. Statistical and machine learning models are hungry for data, and what better place to get it than your clickstream database? Once you done the hard work of attributing “clicks” to visitors, you can segment the data in a seemingly infinite number of ways. That’s great for your data science team as well as more traditional analytics in BI who can now write SQL queries to answer questions like, “How many visitors from Canada came to the site yesterday?”, or “Which of my customers are heavy readers of my blog posts?” In addition, you now have the raw materials to build more advanced customer segments, perform path analysis, and more.

Engineers and IT often use raw web logs to monitor issues with the web site or to understand how heavy traffic is at certain times of the day. Giving them access to a place where they can query the raw data or predefined aggregate tables in SQL opens up a new world of possibilities for them.

Complementary tools

I know, I just wrote all about the reasons you shouldn’t trust third party tools like Google Analytics. Well guess what, they still provide a ton of value! Mostly likely you’ll still be collecting enough data in those tools to get an idea of trends and an aggregate view of your business. Keep in mind that certain segments of your audience might be more likely to use ad blockers, thus skewing your data, but the value of quick access to numerous reports and dashboards that come form something like GA is still immense.  Just keep the limitations in mind.

Privacy matters

Those of us in Analytics can get caught up in the excitement of building something like a clickstream database, and the challenges of understanding who’s doing what on our web sites. It’s important to remember that there’s a reason for the popularity of ad blocking. It’s not just about blocking ads, but also driven by concerns about privacy. Tracking customers across domains for the purpose of advertising to them is not what we’re interested in here. We care about OUR customers and how to optimize our product for them.

With that in mind, be thoughtful about what data you store in your clickstream database and what you do with it.  We don’t store any personal information about our customers in our clickstream database, not just because it would be creepy to do so but also because we don’t need it. An analyst might want to know about how a segment of similar customers uses the site/product, but doesn’t need to know who they are.  Of course, if we did a good job in attributing clickstream records to customers who end up signing into the site, we link some key findings from the clickstream database to the limited customer information we store in our data warehouse.  Just like anything else in Analytics and Marketing, think about the impact on the humans in that database before acting. Just because it’s legal doesn’t mean it’s within the ethical bounds set by your employer, or even more importantly those you’ve set for yourself.

Show more