2011-10-05 Update! A new version of this article and a FREE and enhanced Microsoft Excel 2007 example is available in my blog post at Cardinal Path: The math behind web analytics: box plot.
This articles aims to highlight the importance of box plots (or horizontal box, or Whisker plot) in analytics and provides a step by step example to make them in Microsoft Excel 2007.
What are box plots?
A box plot is a visualization method used in statistics. It shows, in a glimpse, several very important elements describing a data sample:- the smallest observation
- the lower quartile (25%)
- the median (50%)
- the higher quartile (75%)
- the largest observation
- abnormal data (outliers)
- average
How can they be used in web analytics?
Boxplots are a quick visualization approach for examining one or more data sets. Because they can easily reveal the limits of acceptable data and any extremes, it becomes very easy to explain trends and abnormalities and communicate the right information.We have to consider, however, than in term of data visualization, some ways of showing Probability Density Function can also reveal the same information, and more, in a very interesting way. However, drawing them in Excel might be very difficult...
How can I draw my own box plot?
For this example, we will use the fictitious data representing number of Visits to a Web site on a day by day basis, as shown below:- The first thing to do is to sort the values (the Visits column) in ascending order.
- Get the following statistical information from your data. Those are the essential ones to build the box plot:
- We now need to calculate the width of each box and the lower/higher non-outliers limits:
Select the 25th, 50th and 75th percentile values (yellow cells) and Insert a new 2-D Stacked Bar graph (or press Alt-F1 and then Change Chart Type). Then Switch Row/Colum (under Chart Tools/Design menu). You should get something like this:
- Some formatting will get us closer to our goal:
- Delete the Legend
- Delete the Horizontal (Value) Axis Major Gridlines
- Set the 1st box and set it to be totally transparent: right-click, Format Data Series..., Fill, No fill
- Set the 2nd and 3rd boxes to be transparent with a border: no fill as above, then Border Color, Solid Line
- Set Vertical axis to cross at maximum value: select the Horizontal (Value) Axis, right-click, Format Axis..., Vertical axis crosses: Maximum axis value
- Hide the vertical axis bar: select the Vertical (Category) Axis, right-click, Format Axis..., Line Color, No line
- Set the Category axis label: right-click in the chart, Select Data, pick the Horizontal (Category) Axis labels, click the Edit button and pick the cell containing your data label
- Set the lower limit and higher limit ranges
- Select the first box, click on Chart Tools, Layout. Select Error bars, More error bars options..., pick the Display Direction: Minus, indicate the Error Amount: Custom and click the Specify Value button. Leave the Positive Error Value as is and select the blue cell (Lower limit) for the Negative Error Value.
- Same as above for the 3rd box, but choose the Display Direction: Plus and pick the Higher limit (green cell) cell for the Positive Error Value.
- We're getting close, we will now add the average:
- Select the cells containing the average label and value (C3:D3). Do a copy (ctrl-c). Simply select the graph and paste this value (ctrl-v) as a new series.
- It's being added as a new box in our graph. Select it and Change Series Chart Type..., pick the X Y (Scatter) with only markers.
- This will show the marker, but it won't be positioned correctly in the graph. Remember we switched rows and columns earlier: right-click anywhere in the chart, Select Data... then pick the newly added series (Average). Edit and switch the X and Y values.
- Pick the left-side Secondary Vertical (Value) Axis and rick-click to Format Axis..., set the Minimum value to -1 and the Maximum to 1. Close the dialog. You can now delete this extra axis scale indicator.
- The average is now positioned correctly. You might want to change the indicator to a more visible symbol.
- Let's add the last touch! We want to show the outliers:
- Right-click the graph and Select Data...
- Click the Add button
- Set the Series name to "Low outliers"
- For the the Series X, select any value from the sorted data that is lower than the minimum valid range (D2)
- For the Series Y Values, put ={0} so the new data points are shown correctly on the vertical scale.
- Use the same technique for the "High outliers" with any value that is larger then the maximum valid range (D7)
- Some minor aesthetic adjustments and we're set!
Analysis
- We see that Visits ranged from about 125 to 250.
- We see the normal range to be between 165 and about 200.
- We spot two abnormal values: one too low, probably an indication there was a problem on that day, and one two high, for example a spider or web bot crawling our site.
Some references
- "Excellent Analytics Tip #9: Leverage Statistical Control Limits" by Avinash Kaushik
- "More on Excel in-cell graphing" on juiceanalytics
- "Horizontal box and Whisker plots" by John Peltier
- "Box plot" on Wikipedia




81 comments:
Stéphane: This is a wonderfully helpful post, every practical. I hope that everyone gets as much use out of it as I did.
Thanks for taking the time.
-Avinash.
Not sure I understand the min valid and max valid functions - do you have a reference or explanation please?
Min Valid is the smallest value that is not an outlier. Same thing for Max Valid.
An outlier, as Wikipedia explains it is "a single observation 'far away' from the rest of the data", or statistically more than 1,5 times the difference between the 3rd and the 1st quarter.
Thank you! This was exactly what I needed. Very clever!!
Excellent help! Thank you for this great tutorial!!
I've been looking for a way to make box-plots on excell for a long time, this page really helps, thanks a lot, this is exactly what I was looking for!
There's an add-in for Excel called Analyze-it at http://www.analyse-it.com/ that generates all kinds of box-whisker plots. I find it saves a lot of time compared to trying to create the plots yourself!
Thank you so much! I can't believe that Excel doesn't have this as a standard feature! You have certainly saved my life, and I'm sure my professor will be impressed.
Can someone please tell me how to make the box-whisker plot using the analyse it software? Please email me at gisszy@yahoo.com
Thank you- Giselle.
Awesome article
THANK YOU SO MUCH
thanks for these instructions.
One question: is there a way to print the values of the median, the quantiles and the other values on the horizontal axis instead of the normal values as seen in your pictures?
@codedrinker:
just use a "column chart" (Säulendiagramm) instead of a "bar chart".
i hope this helps you....
Outlier question! I tried to enter the formulas for the minimum valid and maximum valid points and was told I had a circular error. Am I reading the formulas incorrectly?
I read the formulas in column E as:
Min valid: =MAX(E1;E2-1.5*(E5-E2))
Max valid: =MIN(E6;E5+1.5*(E5-E2))
They are circular because their results are based on eachother (min valid is in cell E2, max valid is in cell E7).
Thanks to anyone who can solve this!
Leah,
Anything lying within a distance from the box closer than or max 1.5 times the heigth of the box is concidered not to be an outlier, so the formulas are
Max(D2;D3-1,5*(D6-D3)) and
Min(D8;D6+1,5*(D6-D3))
This is a great help to me thanks to the author!!!
Thanks for the information. To be honest, it was hard for me to understand because my computer has been installed in Dutch.
But again: thanks, finally I can give my students a nice looking boxplot!
Helllo,
I am trying to comupte the valid minimum and maximum, but I am not sure to understannd the concept as for me the answer would be MAX(D1;D3-1,5*(D6-D3)) This question as already been discussed, I would be greatful if you could give me your explanation on how to compute this?
Thank you
A much easier way to do Boxplots in Excel 2007 is described on this website:
http://www.coventry.ac.uk/ec/~nhunt/boxplot.htm
just scroll down to "Excel 2007" and you have a much faster method for creating them, involving Line charts instead of Bar charts.
The above referenced alternative method leaves out the mean values. Without them,it is not a box-and-whiskers plot, as originally conceived in 1977. Not very useful...
Hi! Thank you very much for dusch explanation.
Could you please explain how to set 2nd and 3rd boxes together. i can't hire line between them.
thank you in advance for help!
This is a very helpful post, as it seems that microsoft have complicated things greatly with this update of office. A much simpler way to do this is get a program called 'PAST'. It is mainly used for Palaeo data, but I have found it useful for my ecology data as well. Its a very simple program, so don't expect works of art ;)
Anonymous: About the 'PAST' tool, please post a link where we can find it.
Hey.. when I add the average and the outliers, I can't switch the X and Y. Whenever I do that, the rest of my box-plot (which I could easily make after reading your instructions, thank you!) jumps around and changes too. I can't find where I can switch only the average and the outliers, will someone please help me?
This post has been extremely useful, but it has taken me a while to figure out the min/max valid formula to use.
Basically, the length of the 'whiskers' is (1.5 * the interquartile range) away from the upper/lower quartile.
So in Excel, the formula for this lower value is Q1 - (IQR*1.5):
=PERCENTILE(Range,0.25) - (PERCENTILE(Range,0.75)-PERCENTILE(Range,0.25)*1.5)
Now, depending on the distribution of your data, this value might actually be lower than the MIN value, so you need to compare the min value with the value from the above formula, and take whichever is higher, giving a final formula of:
=MAX((PERCENTILE(Range,0.25) - (PERCENTILE(Range,0.75)-PERCENTILE(Range,0.25)*1.5)),MIN(Range))
Sorry if this is obvious, but I found it confusing from the example on this page. Note that you don't actually need to sort your data first as that's done behind the scenes by the formulae.
hi can anyone guide me on the building the spider web concept on excel...
Awesome!! Thanks. :)
when making the box plot the error bars come out all messed up, can u help me!!!!!
How about when you have multiple outliers? The Outlier series only shows the first value.
Sorry for the post, it problem was corrected by adding a dummy series for the Y axis value=0. Thanks
I am still confused about the min valid and max valid..is there any other way to draw the 'whiskers' using only mean, median, 1st quartile and 3rd quartile?Thanks!!
"Min valid" indicates what is the minimum axis value to draw the boxplot. It's either the minimum value found in the data, or the minimum value in the acceptable range (1.5 * the interquartile range). Without this, your boxblot could miss its left side (if the smallest value in higher than 1.5 * interquartile). Same concept for Max valid.
you are worthless!
Anonymous: I have decided to publish your comment because it is a profound mark of disrespect and I had to share my feelings.
You said you were confused about the use of "Min/Max Valid" and I tried to answer a first time.
You then replied saying it wasn't clear enough. So I tried to make it clearer.
Insulting me is really not the right way to get an answer. I have spent hours building this example and countless people found it useful.
If you spent just a couple of minutes trying different things with the values of "Min/Max valid" you would see how they are used.
So now, M. (or Miss) "Anonymous", go figure it out by yourself.
A different anonymous: Stephanie, thanks for your instructions. They are extremely helpful for first timers.When looking at the original instructions, I see the min/max valid using cells with an E value. These cells do not have data. What are the correct cells for your formula. Thanks.
Friendly Anonymous :)
Where it says "E" in the formulas replace them with "D" and that should do the magic :)
on 1/25/2008 Peter said...
the formulas are
Max(D2;D3-1,5*(D6-D3)) and
Min(D8;D6+1,5*(D6-D3))
I think there's a typo in the first one, and these should be:
D2=Max(D1;D3-1,5*(D6-D3))
D7=Min(D8;D6+1,5*(D6-D3))
Thanks to Gregory for clearing up my confusion. Definitely no need to apologize as it wasn't obvious to me! Great tutorial overall and much appreciated.
Thanks Stephane - I've been so frustrated trying to create these in Excel, and your post has helped enormously. Much appreciated!
Stephanie,
Is it possible to get the directions separate? The reason I ask is that when I print it out, some pages have writing on top of other writings.
Jeanne
Anonymous: if you have difficulties printing from your browser, copy & paste in Word and print from there...
I'm having a problem with the left whisker of the plot starting inside the box (where the two boxes meet) and not on the left boundary. I've followed all the steps and can't seem to figure out what i did wrong. Can anyone assist?
-M@x
Maxim, you need to select the 1st box (the one that will be hidden). Don't hide it straight away or you can't select it.
I tried this method with the boxes vertical (I prefer them that way) and I didn't need to swap the axis or use and XY chart, although I'm not sure if it really depends on the data.
I also had problems with the valid max an min calculation, in the other the examples I found they are not interdependent and the formula does not need to be recursive.
Great tutorial
Graeme Melia
thank you! That looks like a great resource
Changing to MAX(D1:D3-1,5*(D6-D3))
and MIN(D6:D8+1,5*(D6-D3)), still does not help. It returned a value of zero to me!!!!
dear thank u very much it is really helpful
I just wanted to ask if there is another way tto find min and max value regardless of the formulas we have. Can we calculate it on our own?
Great Post. I bet the majority of the confusion about the Min/Max Valid formula arise from the fact that the comma is being used as a decimal and the semicolon as a comma. The following formula worked for me.
Min Valid
=MAX(D1,D3-1.5*(D6-D3))
Max Valid
=MIN(D8,D6+1.5*(D6-D3))
I'm having problems with the Outliers. You specify the x and y values to put in but I can only work with the whole series value. Not with the individual x and y series.... Anyone can help please?
wow! thanks for another great blog post!
Customized application development
so this means Excel 2007 don't actually have a Box and whisker plots function? we have to do it manually?
FANTASTIC! Been struggling my head off trying to figure this out. With the help of the comments for min and max value and how to set the minus error by using the box that I "remove" afterwards made it all good! The rest is normal formatting for labels and such.
Thanks a 1000 times!
Thank you so much. I really needed this for a project in school. Thanks to you I might get an A. Don't you feel great about helping the future of the world?
Thanks a lot for sharing this Excel document and for taking the time to explain your process. Your time is really appreciated.
One of the thing I was struggling with, when trying to understand the box plot concept was the min valid and max valid functions (like @hubert). The thing is, since I had no abnormal values, Min valid and Max valid was equal to Min and Max. It drove me crazy as I was assuming something was wrong, when in fact all was good.
So I just want to point out : it is normal that Min valid = Min and Max valid = Max if there are no outliers.
Kinaze: Yes! Exactly: min-valid = min and max-valid = max if there are no outliers.
Hi could you have more boxes and whisker plots on the same graph. Basically I have three separate sets of data and would like the box and whisker plot to be shown individually, however all three should be on the same graph. I managed by following the instructions to create individual ones so far.
Thank you
Brilliant workaround, Stéphane! As an 8th grade teacher, I've always been frustrated that you can't do Box plots in Excel. I'm going to share this with my whole district!
Wow It takes a lot of steps but it really works. Your suggestions are amazing!! Thank yuo thank you thank you!!!
Excellent, but how do you add more than one boxplot to a chart?
I tried without success to add several box plots to the same chart.
But I solved the problem by making one chart for each boxplot. By putting the charts close to each other it was possible to make it look like one single chart after some formatting. Better than nothing...
Kepler
Great tutorial -- thanks very much!
To add more than one box plot to a chart, all you need to do is add another column of data (or a row if you're data is organized in rows), and then highlight all of the data before creating the chart. I just did it with three sets of data and got three box plots without any problems.
Excellent work! I've automated a very similar solution into an excel ribbon button and made it available to all under my open source "excel stats ribbon" project.
http://sourceforge.net/projects/excelstats/
I still have some work to do to make it more user friendly. So if anyone wants to pitch in on the coding, I'd love the help
thank you so much for this tutorial, very useful!
I find this can help if you have to manage just with positive values (i.e. time values like hours) and you have problems to set the min valid. Change the "Min valid" formula like this:
Min Valid=MIN(D1;ASS(D3-1,5*(D6-D3)))
So if you have as "minimum value" 0, this will be always also the "min valid value".
Laura C.
Very useful post, thank you Stephane.
One thing that didn't seem quite right to me though was the values used for the Min/Max Valid points. My understanding of these values (from wikipedia page: http://en.wikipedia.org/wiki/Box_plot) is that the values should be "the lowest datum still within 1.5 IQR of the lower quartile, and the highest datum still within 1.5 IQR of the upper quartile", and not just the calculated values of 1.5 IQR.
I have therefore edited my formulas as follows to ensure that they min and max (respectively) datums are chosen:
Min Valid:
{=MIN(IF((B2:B11<=(D6+(1.5*(D6-D3))))*(B2:B11>=(D3-(1.5*(D6-D3)))),B2:B11,""))}
Max Valid:
{=MAX(IF((B2:B11<=(D6+(1.5*(D6-D3))))*(B2:B11>=(D3-(1.5*(D6-D3)))),B2:B11,""))}
This changes the min valid value to be 158 and the max valid to be 201.
I am looking forward to using these box plots in my next presentations to my senior management team. Thank you again for creating this very useful reference site.
Barrie (UK)
Barrie: yes, indeed, you are right. Since that post I have worked on a newer version of the box plot which I should be able to release soon. The new version will be more robust and has some pretty cool features! (should be release in January 2011)
Hi Stephane, I look forward to seeing the new version in January. Will it be posted here or on a new page somewhere else? If it is somewhere else can you please post a link here?
The other changes that I made (for my own personal usage) was to extract the outliers (at the extremes) into seperate cells so that I could more easily plot them - the data that I tend to work with can be hundreds or thousands of entries long and a visual/manual search through the data is not feasible. If you would like to see how I have done that (you may already have this covered as one of your enhancements) then let me know and I will be more than happy to share?
Best regards
Barrie
Barrie: I will make sure to put a link from this post to the new version (I get a huge amount of traffic just from this single post... looks like a lot of people are seeking ways to do box plots in Excel!)
Other than a couple of bug fixes (as you mentioned), I have improved the visual and there is now a table showing the outliers. The min & max outliers are also shown on the graph. I think you'll like it!
I have seen some people selling such templates for 49$... so I have to decide if I will make it available for free or charge a small amount for all the time & effort I have put into it...
Stéphane
Very good directions, I would never have been able to do it without them
This was extremely helpful! Thank you for taking the time to help out with this:). I just have one question, is the semicolon used in the min valid number the same as a colon or comma? For some reason my Excel insists that something is wrong with the semicolon. I have the 2007 version if that makes any difference.
thank you sooo much
this was such a great help but i don't understnad how to calculate the 25th adn 50th things. anyway it was still such a great help
Great instructions, thanks!
cool post bro
Thanks for a clear, concise, and marvelously helpful post!
I did have an issue when I made the vertical axis cross at the maximum value--it made the error bars go in the wrong direction. For me, it worked better just to make the vertical axis invisible. Also, as some have noted above, be sure to add the error bars to the first box BEFORE you make it invisible, otherwise you won't be able to format it.
Great post! Thanks for figuring this out and sharing it with everyone!
i don't understand how you added min and max outliers...
Hi Stephan. First off, thank you so much for this blog. I have a question regarding Barrie changes (Thanks Barrie:)) with regards to the Min and max valid changes. I've tried this and i get an #value error.
I'm checking if Im reading the formula correct, it is e.g min valid =MIN(IF(((range of your data)<=(Quartile 3+(1.5*(q3-q1))))*(range of your data>=Q1-(1.5*(Q3-Q1)))), range of your data, "")). Have i understood this correctly and if so, what am i doing wrong. Is this new function necessary compared to your origional, as i've tried that and it has worked.
Many thanks
Great instructions, have been very handy.
Hi Stephan. Great post, really. Are you planning to provide the new release soon as mentioned in one of your latest comments?
Thank you so much Stéphane for your time and effort for that.
That help me understand clearly and could create own dataset.
I would like to put all outliers in the graph, could you please share your idea?
Regards,
KTK
Singapore
KTK: look for my next post on Cardinal Path blog where I will post the next article on the Math behing web analytics. There will be a new, super cool version of the box plot in Excel tool.
Thanks for ur awesome explanation on plots in Excel.They are really helpful and interesting.Ur blog presentation made much more interesting.
Thank you !
Thank you Stéphane. I'm trying to teach a little bit of statistics as part of my Physics course, and this will make my students' lives a lot easier
Anonymous: glad it helped! Make sure to see the new version at http://www.cardinalpath.com/math-analytics-box-whisker-plot-in-excel
Post a Comment