Thursday, January 6, 2011

Case study: tracking PDF documents with web analytics

The scenario

You send a newsletter to a select group of corporate clients. The email contains a link to a PDF document (call it P1.pdf) hosted on your web server. Within that document there is a link to another document called P2.pdf. This second document contains article summaries pointing to several external pages. Your boss asked you to provide the following information:
  1. How many people opened P1.pdf?
  2. How many people click to see P2.pdf?
  3. How many click on external links from P2.pdf?
To make things a little bit more complex, you are sitting in marketing and you know the IT release cycle takes some time. You also only have access to an older implementation of Webtrends and would prefer to use Google Analytics.

Solution path

Here we are facing a classic example: it's very common to find older Webtrends implementation and a tendency to see the Holy Grail in Google Analytics.

Webtrends Analytics (not the On Demand service), because it uses logs, has a built in report to answer this need. But we can also instrument the email itself. Legitimate requests can become quite complex... so let's try to simplify that.
  • Email tracking:
    The starting point, which is easier, is to make sure your email is properly tagged as a campaign (suppose we will call it ABC) Email solution providers will tell you how many emails were sent, delivered, opened, and which links within those emails were clicked. Some solutions will even allow you to tag each email individually so you'll know exactly who opened it and clicked. It won't give a perfectly accurate picture but it will be precise enough - especially if you have a large subscriber base (which isn't the case in this scenario).
  • Opens or downloads? People or requests?
    In Webtrends you won't know how many people really opened P1.pdf or clicked to get P2.pdf - you will know how many devices requested a downloaded from your server [unless you use the unique tagging technique outlined above or have some other kind of authentication before allowing the document to download]. If the document is sent around internally you won't know it. Also, because of network caching and such, you will not get a perfectly accurate account of the downloads.
  • P2 downloaded from P1 or somewhere else?
    In P1.pdf, if you modify the link to P2.pdf to be something like P2.pdf?cid=ABC you will be able to make the distinction between P2 downloaded because of your campaign vs other reasons. But again, the picture won't be perfect but certainly good enough.
  • The world beyond your control
    You are left with external links... Those are within P2.pdf so tags won't work. They go to external sites so you can't just add ?cid=ABC to them... Here we won't have any other choice but using redirects. The concept involves creating a server-side script that will relay the query to it's final destination, capturing data at the same time [it's easy to find such scripts for various technologies, but you'll definitely need the help of IT to tweak it]. Instead of linking to you would link to Again, if you are using Webtrends, you will get a server-log request and it will be easy to create a custom report just for that.

Additional notes

  • If you absolutely want to use GA you could use the redir solution for all links. However... I believe measurement should never come in the way of user experience. Customers first!
  • There are some PDF tracking solutions out there but I never personally used any of them and my little research revealed nothing very serious or reliable (I'll update this post if someone prove me wrong).

Most optimal & realistic solution: making web analytics easy

As we've seen, this could become complex and take some time to implement. We want to make web analytics easy - work on what we can control and provide quick incremental ROI.

I would start by making sure emails are properly tagged. Then, if possible, I would use the unique ID technique to know who opened the files. Beyond that, the cost of acquiring the information might not justify the effort. We can also wonder if decisions would be taken differently with additional data.

What do you think? Do you see other alternatives or solutions?
(thanks to @jfmonfette for this great question!)