Detecting outliersLet's say the above chart represents traffic to a site. Something is obviously wrong between time 24 and 28. If we were to do a box plot, we would see the values over 700 to be outliers. As an analyst looking at the data, and depending on the context, we might, or might not, want to exclude those extreme data points from our analysis.
In this example, is the sudden spike a case of server misconfiguration or a software bug? An attack or a bot repeatedly hitting our web site, or maybe the effect of being Digg'ed? Again, the data needs to be put in context in order to provide a good story.
A more complex situationThe other example might be more realistic, but is also much more complex.
Now we have a situation were a trend is suddenly being disrupted by something, making the whole base level shift from it's regular "State A" to a new level at "State B". We also see some impulses affecting the trend.
The law of relativityNow imagine we were to do daily analysis as data becomes available. When we reach the 16th data point, it will show as an outlier. But as we progress toward the 20th data point, they might become valid data. So depending on the data range we use, even if it is statistically valid, outliers might change dramatically. But we don't know what to expect after the 71th data point and beyond... any guess?
The challengeWe're wondering if there is a mathematical way to detect state changes and impulses in a data set. And to make it even more complex, how could we use predictive analytics as we move along the data set? Any help would be appreciated.
P.S. I'm still negotiating with my boss to be able to register for the course on predictive analytics at the eMetrics Summit...