Thursday, 10 March, 2011

The math behind web analytics: mean, trend, min-max, standard deviation

In the second installment of this series, we will leverage Excel to take over where Google Analytics left us.

  1. The math behind web analytics: the basics

Basic charting in Excel

The very first thing to do is to show the data as a simple line graph. For this post, I simply used visits to my blog in January of 2011. After some minor visual adjustments we end up with something like this:
Figure a: simple Excel charting
Figure b: time series (visits)
There are already some striking things: peaks & valleys corresponding to weekdays and weekends, and a week apparently performing better than others. Now we can easily apply some basic statistics on our time series.

Mean

The mean [wikipedia] is often referred to as the "average", which, in reality, is the "arithmetic mean". This is very simple math: add all the numbers and divide by the number of data points.

Look at Figure c - what can you tell about the red line crossing the whole graph? In a time series like daily visits for a month, honestly... we can't tell much! Yet, only averages are reported by most web analytics tools - so please, don't even bother saying "the average number of visits this month was X"!
Figure c: showing mean, trend, min & max and control limits.
Learning point: The average is rarely a good indicator in a time series such as those found in web analytics because it is influenced by extreme values (known as outliers [wikipedia]). At best, in the case above, one might want to calculate the mean for weekdays and the mean for weekends. As a rule of thumb, if you have less than 30 data points, use the median.
Figure d: descriptive statistics

Median and mode

The median [wikipedia] is the middle value. The mode [wikipedia], on the other end, is the value appearing the most frequently. Again, in a time series, where the spread of values (the standard deviation explained below) is large, those descriptive statistics [wikipedia] (Figure d) are usually of little interest.

Min & max

The min and max values are... well.. the maximum and minimum values in a time series. Those could be qualified as "anecdotes" - we could be thrilled we've got so much traffic on a single day, or deceived by a poorly performing day, but knowing that has absolutely no value if we can't explain why.

In the time series used in this example, the min value is 93 visits on Saturday, January 1st. What can we tell about that? Obviously, people were busy doing something else than visiting my blog. What happened during the 4th week, around January 25? I shared my views about our little web analytics community and recounted my contributions. In both cases, we have very plausible explanations and the min & max values were useful only because they made us ask "why?".

Trend

To me, the linear trend [wikipedia] (shown as a dotted line in Figure b) is one of the interesting modeling stats because it marks the begining of our regression analysis [wikipedia] capabilities - our ability to explain the why's and "this, therefore that". Basically, it can help us do some predictive analytics (albeit very simple). Remember y = mx + b? That is, the position of a point on the y axis (the visits) depends on a factor of x (the day) plus a starting baseline. I can tell, based on historical data, that I should get approximately 350 visits next Tuesday.

Standard deviation

If we do max - min we get the range [wikipedia], another descriptive statistic. Interesting at best. What's much more interesting is the standard deviation [wikipedia] - the variability of the data. As we've seen, the average isn't of much use because it is largely influenced by outliers. Standard deviation gives an appreciation of the spread of values around the mean, or if you prefer, the variation in a distribution of values.

Why is this important?

Figure e: control limits at +/- 1.5 sigma
First because standard deviation will be used to set control limits [wikipedia] (Figure e) - which in turn will be useful to define our tolerance and targets (covered in a later post). While control limits are typically set to +/- 3 times the standard deviation from the mean - I have found +/- 1.5 times (for a total of 3) to provide a better and easier indicator of values going below or above our historical track record (shown as the grayed area in Figure b). Basically, it gives us an easy way to set alerts when our metric might be going out of whack!

Secondly, a large variation is an indication of an unstable process (think conversion rate), or low reproductibitiliy (anecdotal campaign success), or if you prefer, a larger standard deviation reduces our ability to predict the value of Y given a certain X. Basically, as analysts, we want to explain the past, but we also want to provide insight on how to fix issues and seize opportunities - we eventually want to be able to predict outcomes of our recommendations.

Coming up: normal distribution, histogram and box-plots

In the next installment we'll look at what is an histogram as well as normal distribution and their impact on our analysis. Also, although nifty spinning 3D-shadowed-shiny-Flash graphs are impressive... we'll look at box plots elegant simplicity yet powerful and under-used visualization tool.

What do you think of this series so far? What would you like to see discussed or any examples you would like to see?

8 comments:

I'm loving this series, Stephane! I've got a question about your standard deviation, though, and the hint that you're going to use it for HCL and LCL. Does that really work when you've got a general trend that is heading upward? I'm not, not, NOT a stats purist, and I've actually deployed reports where I used the standard deviation (1 or 1.5) to add offsets to a linear trend line. My argument was that, whether it was "correct" or not, it let me laser in on the points that were more outlier-like. In other words, if you applied an HCL and LCL line to the mean...and, overall, you were trending upward, the early data values and the later data values would be most likely to fall outside of those bounds.

Or...am I jumping ahead to a future post? :-)

Hi Stephane, I agree with Tim, also loving this post series as well. Have had numerous discussions with others (and with myself) over the value of statistics in web analytics and currently not in favour of it - although thinking there must be some use.

I really don't think standard deviations are relevant to most web analytics as metrics don't follow a normal distribution (guessing this is in the next post). I think there is likely value in medium and percentiles but haven't managed to apply yet.

One trick I developed long ago was to hack a daily seasonality pattern and use this to remove this daily seasonality. Then you can do a like for like comparison across different days of the week to judge popularity.

I also believe (although never really used in anger) that rather than using absolute numbers and applying statistics to understand if good/bad performance, to compare performance against a forecast and then apply statistics to the % difference. Assuming a good forecast, this % difference should follow a normal distribution.

Also currently writing a "How to Analyse Web Analytics data" on my blog at www.l3analytics.com/blog - need to complete with Part 3 still.

Looking forward to reading/learning more.

Peter

Tim: HCL and LCL vs trends and non-normal data distribution. Indeed, we have to be careful, but I will address that in a later post :)

Peter: I think you would agree that a basic understanding of descriptive statistics, as simple as min/max/avg/trend and stddev are starting point for analysis - but maybe more importantly, knowing the caveats of basic stats and when NOT to use them might be as essential as knowing WHEN to use them :)

Back to HCL and LCL, stddev & trends, maybe visits are not the perfect data set for this example. Maybe a process such as shopping cart to purchase would be a better example - and this would lead us closer to Lean SixSigma concepts (also to be addressed in a later post).

The other point I will bring is how to set targets. We insist on being data-driven but too often I see organizations setting goals to increase revenues/traffic/conversions/etc. by X% out of gut feel and stretched hypothesis. What if, instead, we could leverage our understanding of web analytics and stats to help set SMARTer objectives? :)

These comments are raising even more vaulable points, thank you.

I agree with Peter that taking daily 'seasonality' into account is very important. I see this a lot with things like checkout funnels where abandon rates in checkout funnels increase at the end of the week when people realise that they will not get something until after the weekend. If only things like GA made this easier to understand, rather than requiring us to re-work the data in Excel...

Tim

Hi Stephane, Thanks for writing this series of posts on the mathematics behind web analytics, it will be of great help to me and many more students that are just starting to learn about web analytics. I have been trying to find more detailed information on the subject since posting my question on the Yahoo Web Analytics forum, now you are writing not just one blog post but an entire series! I am enjoying the remainder of the UBC Achievement in Web Analytics modules and I look forward to the rest of this series of posts. Many Thanks,
Dan

Stephane,

I'm really enjoying this series, it's a definite help to getting some of the less... mathy... people in the office to understand what is going on with the analytics data. Please keep it up.

(I'm really looking forward to the multiple regression post! It's definitely one of my favourite tools...)

Cheers,

Peter - the sampling distribution converges to normal though - isn't this what Stephane is referring to? So the sampling error will tend to be normally distributed. I could see that in a time-series you could have Heteroskedastic errors (see ARCH/GARCH models) - so the error structure is not constant over the series.

Very nice series. I'm sure it can be done with Excel tools for GA, but using ReportBuilder and the dayPart plugin for SiteCatalyst you can build control charts for half-hour increments by day of week. Breaking it down by channel gives you a highly useful tool for your marketing managers. The limits are built off of a rolling set of metrics (visits, orders, conversion, etc.) for the same channel, same half-hour, same day of week. Use standard deviations if the data are normally distributed, use Tukey's Rule otherwise. Build a candlestick graph with each half-hour on the X axis, and then chart the actuals over that series. Schedule it being emailed to managers periodically. It's very easy to see intraday movement against norms and whether or not there's reason for a fire drill or celebration.