Tuesday, March 29, 2011

Web analytics ethic: from theory to practice

A week ago I published three short cases where people were invited to comment on whether they were legal, ethical and abiding by their web analytics vendor Terms Of Service (TOS). Inspired from my own experience and after much talk about the WAA Code of Ethic, sessions at the recent eMetrics and discussions I had with some vendors, I thought participation would be much higher.

Here’s my point of view and some info from the brave souls who were up for the task! You should really read the previous post before continuing!

Photo: stock.xchng
Disclaimer: I'm not a lawyer nor a specialist of ethics - this information is provided as is... do your homework!

The majority of the 14 respondents were from the US and UK with some participants from Canada and other European countries. Unsurprisingly, most respondents said they were using Google Analytics.

Case #1: matching transaction id against back-end.

Unsure No Yes
It is legal 20% 0% 80%
It is acceptable based on my TOS 27% 20% 53%
It is ethical 13% 13% 73%

In my opinion, this is perfectly legal – the data was collected with user consent in the context of a commercial relationship. It is also ethical – it is common and accepted to send a “thank you” email, along with the purchase details and some offers. The fact it is sent through traditional snail mail doesn’t matter – or does it? Since the transaction was done online, there is usually an expectation communications will also be conducted online. As one of the respondents put it, “At the end of the day, 'ethical' depends more on your relationship with your customer than anything else”. All serious tools vendors TOS specifically prohibit sending Personally Identifiable Information (PII) to their system.

A transaction id, which is clearly not PII, is typically set by your back-end system and stored in your web analytics service of choice. This is a piece of data coming from your own system, and used back to merge against it, generally no TOS issue – except with Google Analytics TOS! (emphasis mine)
7. PRIVACY . You will not (and will not allow any third party to) use the Service to track or collect personally identifiable information of Internet users, nor will You (or will You allow any third party to) associate any data gathered from Your website(s) (or such third parties' website(s)) with any personally identifying information from any source as part of Your use (or such third parties' use) of the Service. You will have and abide by an appropriate privacy policy and will comply with all applicable laws relating to the collection of information from visitors to Your websites. You must post a privacy policy and that policy must provide notice of your use of a cookie that collects anonymous traffic data.
Repeat: "You will not associate any data gathered from your website(s) with any personally identifiable information from any source as part of your use of Google Analytics". Essentially, if you use Google Analytics, you should not extract transaction ids to merge them back against your own system. This is a non-sense to me and I know of several organizations that are actually doing it – probably without realizing they are breaking their GA TOS. Let’s hope this will be revised.

Case #2: matching product id (SKU) against back-end

Unsure No Yes
It is legal 7% 0% 93%
It is acceptable based on my TOS 27% 0% 73%
It is ethical 7% 0% 93%

Legal, ethical and no TOS issue. The key element here is that no PII is involved. From a business standpoint, what’s interesting is the ability to use behavioural data to correlate with sales in order to build a predictive model where we “know” which online behaviours are early indicators of upcoming sales and therefore, adjust inventories accordingly.

Case #3: key created from (potential) PII without user consent

Unsure No Yes
It is legal 33% 20% 47%
It is acceptable based on my TOS 53% 40% 7%
It is ethical 33% 33% 33%

If I got it right, in the US: last name alone, 5 digits zip code or last digits of phone number are not considered PII.

However, in California, OPPA specifies what is typically a non-PII become PII when combined with other data (such as having gender associated with a specific person). In Canada, the PIPEDA law stipulates data must be collected with user consent and used for the purpose it was collected for. In Europe, and especially Germany, a last name is PII (so are IP addresses and a whole bunch of things!).

Is it ethical? In this specific case, the data is stored even if the transaction isn’t fully completed. Therefore, this practice is against the 3rd WAA Code of Ethic guideline: User Control. It is also against PIPEDA in Canada.

What about the TOS? In general, this wouldn’t be an issue and it doesn’t really matter if this string is further encoded to obfuscate it. However, Google Analytics TOS still doesn’t allow us to use this key to merge with any other data that could contain PII.
In airports, the stand by list typically shows first three letters of last name and first letter of first name

My take

While there are passionate arguments on "free vs paid" in the #measure tweet universe, I was sincerely disappointed a topic like ethic and legal didn’t raise much interest. Is it because of a lack of interest? Fear of being wrong?

Either way, it makes me wonder if web analysts happily embrace the WAA Code of Ethic because it feels good and it's a worthy cause... or are just full of it! I guess what’s most important for now isn’t to know all there is to know about ethic, legislations and TOS, but to take action when innapropriate situations are uncovered.

I don't pretend to know more than anyone else, in fact, I'm willing to be wrong! If you have comments or additional useful references, I would love to hear from you!

Monday, March 21, 2011

Web analytics ethic trivia

While I'm still working on the 3rd post in my series on "the math behind web analytics", I thought we could play a little game related to the WAA Code of Ethic.

Read the three cases below and for each one, think if this is something you might be doing already or would feel ok to do... or not. Then you'll be invited to vote and comment (anonymously).

  1. For an ecommerce site using your web analytics vendor of choice, the transaction id, along with traffic source data (referrer, campaign, search keywords, etc.) and micro-conversions info (which other business valuable tasks were completed) are extracted using the API. The transaction id are then looked up against your sales database in order to do further segmentation and build a customer list (name, address, purchase details, demographics, etc.) that will be used to send a "thank you" snail mail, along with a 50% discount on a future purchase. Clearly, there is a customer-vendor relationship in place and the information for the purchase was collected with user consent.

    Is this legal? Is this allowed by your vendor TOS? Is this ethical?

  2. Still for the same ecommerce website, you extract the product SKUs, along with item quantity sold and the same source data and micro-conversions info. The data is merged against the back-end inventory database using the SKU and predictive models are developed to know which stock levels are optimal for each SKU.

    Is this legal? Is this allowed by your vendor TOS? Is this ethical?

  3. A financial institution typically has several types of requests: credit card, mortgage and other financing inquiries, retirement simulator, insurance quotes, etc. Completed requests are stored in back-end systems for processing - those transactions are frequent targets of fraudulent behavior or are abandoned along the way. One way to generate a unique key is this: first 3 letters of last name + 1st letter of first name + 4 digits zip code (or last 3 characters of postal code) + last 4 digits of phone number. The result, for me, would be HAMS4C02637.

    Is this legal? Is this allowed by your vendor TOS? Is this ethical?

    Update: to make it clearer, this type of key could be used for lookups against other systems - and could be encrypted using MD5 to make it more obscure - but it is still built from input data even if the transaction isn't fully completed.
Let's see what you think - I'll share my point of view a bit later (along with pointers to reference material). I certainly don't pretend to be a lawyer or a professional of ethics, I'm just an analyst with some experience. Those three cases are inspired from real situations.

The question is... what would you do?

If you are up for it, head over here to vote and comment.

Monday, March 14, 2011

A major web analytics agency is born: Cardinal Path

Today at the eMetrics Marketing Optimization Summit, my good friends Alex Langshur, President and Founder of PublicInsite, John Hossack, President and CEO of VKI Studios, David Booth and Corey Koberg of WebShare announced they are joining forces to become one of the most significant players in the digital analytics world.

With offices in Ottawa, Vancouver, Boston, San Diego, Burlington, Phoenix, Mountain View and Chicago, Cardinal Path unites an impressive team of thought leaders across a range of disciplines, authors, speakers and top business consultants. Justin Cutroni of WebShare, a well known figure in our field comes to mind, but there are also Brian, Michael, Kent, Ken and Scott - about 35 of them!

I've known Alex & John for a long time and I have utmost respect for both of them. Their professionalism and the expertise they have built through their respective agencies is outstanding. Their success stems from hard work, obviously, but I've found in both of them that undefinable human touch, sense of ethic and respect. Interestingly, as an independent consultant, I was lucky to have both of them be what I called "angel advisors" - sounding boards for all my crazy ideas, but also sharing and collaborating on anything "analytics" related.

While there has been many mergers on the vendors side, this is the first big one on the services side and the strengths of the various partners are very complementary:
  • private, public, non-profit, education sectors; with several flagship clients Like NBC, Harvard University, Library of Congress, Electonic Arts, Virgin, etc.
  • deep ecommerce expertise with leading brands, and boatloads of knowledge for non-commerce, lead-gen and brand based sites;
  • "build for success" approach - the new firm has site design/capability that enables them to architect the success elements and ensure full end to end visibility.

I want to be among the first to wish them success in what will undoubtedly be a great adventure and bright future!

Thursday, March 10, 2011

The math behind web analytics: mean, trend, min-max, standard deviation

In the second installment of this series, we will leverage Excel to take over where Google Analytics left us.
  1. The math behind web analytics: the basics

Basic charting in Excel

The very first thing to do is to show the data as a simple line graph. For this post, I simply used visits to my blog in January of 2011. After some minor visual adjustments we end up with something like this:
Figure a: simple Excel charting
Figure b: time series (visits)
There are already some striking things: peaks & valleys corresponding to weekdays and weekends, and a week apparently performing better than others. Now we can easily apply some basic statistics on our time series.


The mean [wikipedia] is often referred to as the "average", which, in reality, is the "arithmetic mean". This is very simple math: add all the numbers and divide by the number of data points.

Look at Figure c - what can you tell about the red line crossing the whole graph? In a time series like daily visits for a month, honestly... we can't tell much! Yet, only averages are reported by most web analytics tools - so please, don't even bother saying "the average number of visits this month was X"!
Figure c: showing mean, trend, min & max and control limits.
Learning point: The average is rarely a good indicator in a time series such as those found in web analytics because it is influenced by extreme values (known as outliers [wikipedia]). At best, in the case above, one might want to calculate the mean for weekdays and the mean for weekends. As a rule of thumb, if you have less than 30 data points, use the median.
Figure d: descriptive statistics

Median and mode

The median [wikipedia] is the middle value. The mode [wikipedia], on the other end, is the value appearing the most frequently. Again, in a time series, where the spread of values (the standard deviation explained below) is large, those descriptive statistics [wikipedia] (Figure d) are usually of little interest.

Min & max

The min and max values are... well.. the maximum and minimum values in a time series. Those could be qualified as "anecdotes" - we could be thrilled we've got so much traffic on a single day, or deceived by a poorly performing day, but knowing that has absolutely no value if we can't explain why.

In the time series used in this example, the min value is 93 visits on Saturday, January 1st. What can we tell about that? Obviously, people were busy doing something else than visiting my blog. What happened during the 4th week, around January 25? I shared my views about our little web analytics community and recounted my contributions. In both cases, we have very plausible explanations and the min & max values were useful only because they made us ask "why?".


To me, the linear trend [wikipedia] (shown as a dotted line in Figure b) is one of the interesting modeling stats because it marks the begining of our regression analysis [wikipedia] capabilities - our ability to explain the why's and "this, therefore that". Basically, it can help us do some predictive analytics (albeit very simple). Remember y = mx + b? That is, the position of a point on the y axis (the visits) depends on a factor of x (the day) plus a starting baseline. I can tell, based on historical data, that I should get approximately 350 visits next Tuesday.

Standard deviation

If we do max - min we get the range [wikipedia], another descriptive statistic. Interesting at best. What's much more interesting is the standard deviation [wikipedia] - the variability of the data. As we've seen, the average isn't of much use because it is largely influenced by outliers. Standard deviation gives an appreciation of the spread of values around the mean, or if you prefer, the variation in a distribution of values.

Why is this important?

Figure e: control limits at +/- 1.5 sigma
First because standard deviation will be used to set control limits [wikipedia] (Figure e) - which in turn will be useful to define our tolerance and targets (covered in a later post). While control limits are typically set to +/- 3 times the standard deviation from the mean - I have found +/- 1.5 times (for a total of 3) to provide a better and easier indicator of values going below or above our historical track record (shown as the grayed area in Figure b). Basically, it gives us an easy way to set alerts when our metric might be going out of whack!

Secondly, a large variation is an indication of an unstable process (think conversion rate), or low reproductibitiliy (anecdotal campaign success), or if you prefer, a larger standard deviation reduces our ability to predict the value of Y given a certain X. Basically, as analysts, we want to explain the past, but we also want to provide insight on how to fix issues and seize opportunities - we eventually want to be able to predict outcomes of our recommendations.

Coming up: normal distribution, histogram and box-plots

In the next installment we'll look at what is an histogram as well as normal distribution and their impact on our analysis. Also, although nifty spinning 3D-shadowed-shiny-Flash graphs are impressive... we'll look at box plots elegant simplicity yet powerful and under-used visualization tool.

What do you think of this series so far? What would you like to see discussed or any examples you would like to see?

Monday, March 7, 2011

The math behind web analytics: the basics


I tutored about 700 students enrolled in web analytics and business analysis classes at UBC and nearly a hundred in the new graduate-level class in online analytics I'm teaching at Laval University. Most students at ULaval are enrolled in MBA specializing in ebusiness or marketing - one day, they will manage organizations and leverage analytics to make better business decisions. In the meantime, questions and assignments are an endless source of inspiration and challenges to solve.

This is a first post of a series entitled "the math behind web analytics". The idea stems from a question posted by a UBC student to the Yahoo! Web Analytics forum: "what mathematics does a web analyst need to know?" I was somewhat baffled by the replies: "plus, minus, min, max, average... not much practical use of it (mathematic/statistics) within web analytics" or "simple counts or averages" and of course, "percentage... because that's what you'll use most often, e.g. with KPIs, conversion rates, etc".


Assignment: basic analysis

One of the first assignment in the ULaval class is simply stated as "analyze the visits to website XYZ" and the students are provided a data set.

Learning point: When referring to a metric broken down by a time-based dimension, we refer to a "time series": a sequence of data points measured at uniform time intervals.

In this first post we address what appears to be easy and obvious: graphing the data.

All web analytics tools provide basic visualization functions, as in the example shown above from Google Analytics. This graph shows visits by month.

Learning point: Notice how I used thirteen months of data instead of twelve. This is especially important to be able to compare year-over-year and more easily spot seasonality. Basically, we should always include at least one additional period in our analysis. Here, we clearly see an upward trend and certainly some year-to-year progress.

However, a monthly breakdown hides some interesting elements. When the same data is shown by day, we can see something slightly different:

The trap

This is the extent of visualization you'll get in most tools. And most would-be analysts will report something like "there was X number of visits, and the average was Y visits/day" simply because this is what the tool says. Some will mention an upward trend but won't be able to quantify it, at best, a few will switch from monthly to dayly view and mention the very common weekdays/weekend pattern.

What's most important, I rarely see an explanation for what happened where we see spikes of traffic - which, in this case, are explained by marketing and external, business-related events.

Coming up: Excel to the rescue

In the next installment of "the math behind web analytics" we'll use Excel to do some basic analysis.