Prof. Gauvin from Laval University contacted me to share an interesting challenge for a new study he will be conducting. I'm sharing his inquiry because I think there could be some application in the web analytics field. It also relates, in a way, to my previous post about box and whisker plots, especially when I talked about outliers.
Detecting outliers
Let's say the above chart represents traffic to a site. Something is obviously wrong between time 24 and 28. If we were to do a box plot, we would see the values over 700 to be outliers. As an analyst looking at the data, and depending on the context, we might, or might not, want to exclude those extreme data points from our analysis.In this example, is the sudden spike a case of server misconfiguration or a software bug? An attack or a bot repeatedly hitting our web site, or maybe the effect of being Digg'ed? Again, the data needs to be put in context in order to provide a good story.
A more complex situation
The other example might be more realistic, but is also much more complex.
Now we have a situation were a trend is suddenly being disrupted by something, making the whole base level shift from it's regular "State A" to a new level at "State B". We also see some impulses affecting the trend.The law of relativity
Now imagine we were to do daily analysis as data becomes available. When we reach the 16th data point, it will show as an outlier. But as we progress toward the 20th data point, they might become valid data. So depending on the data range we use, even if it is statistically valid, outliers might change dramatically. But we don't know what to expect after the 71th data point and beyond... any guess?The challenge
We're wondering if there is a mathematical way to detect state changes and impulses in a data set. And to make it even more complex, how could we use predictive analytics as we move along the data set? Any help would be appreciated.P.S. I'm still negotiating with my boss to be able to register for the course on predictive analytics at the eMetrics Summit...
Named one of the most influential industry contributors by the Digital Analytics Association. With over twenty years’ experience empowering organizations to analyze and optimize their online channels, Stéphane has cemented his position as a leading voice for online analytics and optimization.


2 comments:
Interesting post.
I don't know that I would call the requirements "data cleansing" (at least not the way I use the term .. see my blog at
horizontal data dancing - it seems to me that since the "immediate goal is to isolate trend data for further analysis" it is more of a feature extraction problem.
There are many ways of going about this. For the artificial data of Example 1, and if we take a post hoc view (that is we have seen all the data before we want to extract the trend information), then a regression with dummy variables to isolate the shock effect should suit the situation.
Now, if you want an estimate of trend at every time period and you take the "sliding" viewpoint (that is, at any intant you know nothing about the future, and the shock comes as a complete shock to you) then a Bayesian approach might suit. Essentially you start with a prior for the mean and variance of the level and trend and update that as each new data point comes along. When an 'outlier' comes along, this increases (massively in this case) the instantaneous estimate variance of the variance of the level and the trend.. at that point you become very uncertain about future forecasts.
You can find this sort of logic expounded in works by Harrison and West - google for BATS Bayesian Analysis of Time Series
For the second example, you might want to treat it the same way (Bayesian), or you might want (post hoc) to divide the series into epochs (pre and post period 16 are arguably different in kind; but post 16 looks like it could be one series with possibly increasing variance, and some persistence of shocks - it could possibly be modelled by an AR (autoregressive) approach. There is some commercial software called AutoBox that will do some of this sort of work automagically for you.
Which approach you want to take depends on your objectives. Do you want a single number ("trend") for each time series? Or do you want to extract another time series called "forward estimate of underlying trend"? Or yet another possibility, a time series called "smoothed estimate of trend" (which involves looking forwards and backwards).
Personally, each time I have come across such pikes in a web site analysis, I could explain the reason after cross-referencing it with other data. From experience, and of course this is an all empirical statement, those were "events" that could not be the object of predictive modeling, since their occured in a chaotic world (i.e. the world of business where unpredictable things happen).
Post a Comment