Box plot and whisker plots in Excel 2007


Bookmark and Share
This articles aims to highlight the importance of box plots (or horizontal box, or Whisker plot) in analytics and provides a step by step example to make them in Microsoft Excel 2007.

What are box plots?

A box plot is a visualization method used in statistics. It shows, in a glimpse, several very important elements describing a data sample:
  • the smallest observation
  • the lower quartile (25%)
  • the median (50%)
  • the higher quartile (75%)
  • the largest observation
In addition, a box plot can help visualize two other elements:
  • abnormal data (outliers)
  • average

How can they be used in web analytics?

Boxplots are a quick visualization approach for examining one or more data sets. Because they can easily reveal the limits of acceptable data and any extremes, it becomes very easy to explain trends and abnormalities and communicate the right information.

We have to consider, however, than in term of data visualization, some ways of showing Probability Density Function can also reveal the same information, and more, in a very interesting way. However, drawing them in Excel might be very difficult...

How can I draw my own box plot?

For this example, we will use the fictitious data representing number of Visits to a Web site on a day by day basis, as shown below:

  1. The first thing to do is to sort the values (the Visits column) in ascending order.
  2. Get the following statistical information from your data. Those are the essential ones to build the box plot:

  3. We now need to calculate the width of each box and the lower/higher non-outliers limits:
  4. Select the 25th, 50th and 75th percentile values (yellow cells) and Insert a new 2-D Stacked Bar graph (or press Alt-F1 and then Change Chart Type). Then Switch Row/Colum (under Chart Tools/Design menu). You should get something like this:

  5. Some formatting will get us closer to our goal:
    • Delete the Legend
    • Delete the Horizontal (Value) Axis Major Gridlines
    • Set the 1st box and set it to be totally transparent: right-click, Format Data Series..., Fill, No fill
    • Set the 2nd and 3rd boxes to be transparent with a border: no fill as above, then Border Color, Solid Line
    • Set Vertical axis to cross at maximum value: select the Horizontal (Value) Axis, right-click, Format Axis..., Vertical axis crosses: Maximum axis value
    • Hide the vertical axis bar: select the Vertical (Category) Axis, right-click, Format Axis..., Line Color, No line
    • Set the Category axis label: right-click in the chart, Select Data, pick the Horizontal (Category) Axis labels, click the Edit button and pick the cell containing your data label
  6. Set the lower limit and higher limit ranges
    • Select the first box, click on Chart Tools, Layout. Select Error bars, More error bars options..., pick the Display Direction: Minus, indicate the Error Amount: Custom and click the Specify Value button. Leave the Positive Error Value as is and select the blue cell (Lower limit) for the Negative Error Value.
    • Same as above for the 3rd box, but choose the Display Direction: Plus and pick the Higher limit (green cell) cell for the Positive Error Value.
  7. We're getting close, we will now add the average:
    • Select the cells containing the average label and value (C3:D3). Do a copy (ctrl-c). Simply select the graph and paste this value (ctrl-v) as a new series.
    • It's being added as a new box in our graph. Select it and Change Series Chart Type..., pick the X Y (Scatter) with only markers.
    • This will show the marker, but it won't be positioned correctly in the graph. Remember we switched rows and columns earlier: right-click anywhere in the chart, Select Data... then pick the newly added series (Average). Edit and switch the X and Y values.
    • Pick the left-side Secondary Vertical (Value) Axis and rick-click to Format Axis..., set the Minimum value to -1 and the Maximum to 1. Close the dialog. You can now delete this extra axis scale indicator.
    • The average is now positioned correctly. You might want to change the indicator to a more visible symbol.
  8. Let's add the last touch! We want to show the outliers:
    • Right-click the graph and Select Data...
    • Click the Add button
    • Set the Series name to "Low outliers"
    • For the the Series X, select any value from the sorted data that is lower than the minimum valid range (D2)
    • For the Series Y Values, put ={0} so the new data points are shown correctly on the vertical scale.
    • Use the same technique for the "High outliers" with any value that is larger then the maximum valid range (D7)
  9. Some minor aesthetic adjustments and we're set!

Analysis

  • We see that Visits ranged from about 125 to 250.
  • We see the normal range to be between 165 and about 200.
  • We spot two abnormal values: one too low, probably an indication there was a problem on that day, and one two high, for example a spider or web bot crawling our site.
Comments and suggestions are always appreciated!

Some references

54 comments:

Avinash Kaushik said...

Stéphane: This is a wonderfully helpful post, every practical. I hope that everyone gets as much use out of it as I did.

Thanks for taking the time.

-Avinash.

hubert said...

Not sure I understand the min valid and max valid functions - do you have a reference or explanation please?

S.Hamel said...

Min Valid is the smallest value that is not an outlier. Same thing for Max Valid.
An outlier, as Wikipedia explains it is "a single observation 'far away' from the rest of the data", or statistically more than 1,5 times the difference between the 3rd and the 1st quarter.

Anonymous said...

Thank you! This was exactly what I needed. Very clever!!

Oli said...

Excellent help! Thank you for this great tutorial!!

Pascale Chavatte-Palmer said...

I've been looking for a way to make box-plots on excell for a long time, this page really helps, thanks a lot, this is exactly what I was looking for!

Anonymous said...

There's an add-in for Excel called Analyze-it at http://www.analyse-it.com/ that generates all kinds of box-whisker plots. I find it saves a lot of time compared to trying to create the plots yourself!

Mike Kenyon said...

Thank you so much! I can't believe that Excel doesn't have this as a standard feature! You have certainly saved my life, and I'm sure my professor will be impressed.

Anonymous said...

Can someone please tell me how to make the box-whisker plot using the analyse it software? Please email me at gisszy@yahoo.com

Thank you- Giselle.

Anonymous said...

Awesome article
THANK YOU SO MUCH

cokedrinker said...

thanks for these instructions.

One question: is there a way to print the values of the median, the quantiles and the other values on the horizontal axis instead of the normal values as seen in your pictures?

zeck said...

@codedrinker:
just use a "column chart" (Säulendiagramm) instead of a "bar chart".
i hope this helps you....

Leah said...

Outlier question! I tried to enter the formulas for the minimum valid and maximum valid points and was told I had a circular error. Am I reading the formulas incorrectly?

I read the formulas in column E as:
Min valid: =MAX(E1;E2-1.5*(E5-E2))
Max valid: =MIN(E6;E5+1.5*(E5-E2))

They are circular because their results are based on eachother (min valid is in cell E2, max valid is in cell E7).

Thanks to anyone who can solve this!

Peter said...

Leah,
Anything lying within a distance from the box closer than or max 1.5 times the heigth of the box is concidered not to be an outlier, so the formulas are
Max(D2;D3-1,5*(D6-D3)) and
Min(D8;D6+1,5*(D6-D3))

This is a great help to me thanks to the author!!!

Charlotte said...

Thanks for the information. To be honest, it was hard for me to understand because my computer has been installed in Dutch.
But again: thanks, finally I can give my students a nice looking boxplot!

Anonymous said...

Helllo,


I am trying to comupte the valid minimum and maximum, but I am not sure to understannd the concept as for me the answer would be MAX(D1;D3-1,5*(D6-D3)) This question as already been discussed, I would be greatful if you could give me your explanation on how to compute this?

Thank you

imperator said...

A much easier way to do Boxplots in Excel 2007 is described on this website:

http://www.coventry.ac.uk/ec/~nhunt/boxplot.htm

just scroll down to "Excel 2007" and you have a much faster method for creating them, involving Line charts instead of Bar charts.

Anonymous said...

The above referenced alternative method leaves out the mean values. Without them,it is not a box-and-whiskers plot, as originally conceived in 1977. Not very useful...

Anonymous said...

Hi! Thank you very much for dusch explanation.
Could you please explain how to set 2nd and 3rd boxes together. i can't hire line between them.
thank you in advance for help!

Anonymous said...

This is a very helpful post, as it seems that microsoft have complicated things greatly with this update of office. A much simpler way to do this is get a program called 'PAST'. It is mainly used for Palaeo data, but I have found it useful for my ecology data as well. Its a very simple program, so don't expect works of art ;)

S.Hamel said...

Anonymous: About the 'PAST' tool, please post a link where we can find it.

Naomi said...

Hey.. when I add the average and the outliers, I can't switch the X and Y. Whenever I do that, the rest of my box-plot (which I could easily make after reading your instructions, thank you!) jumps around and changes too. I can't find where I can switch only the average and the outliers, will someone please help me?

Gregory said...

This post has been extremely useful, but it has taken me a while to figure out the min/max valid formula to use.

Basically, the length of the 'whiskers' is (1.5 * the interquartile range) away from the upper/lower quartile.

So in Excel, the formula for this lower value is Q1 - (IQR*1.5):
=PERCENTILE(Range,0.25) - (PERCENTILE(Range,0.75)-PERCENTILE(Range,0.25)*1.5)

Now, depending on the distribution of your data, this value might actually be lower than the MIN value, so you need to compare the min value with the value from the above formula, and take whichever is higher, giving a final formula of:
=MAX((PERCENTILE(Range,0.25) - (PERCENTILE(Range,0.75)-PERCENTILE(Range,0.25)*1.5)),MIN(Range))

Sorry if this is obvious, but I found it confusing from the example on this page. Note that you don't actually need to sort your data first as that's done behind the scenes by the formulae.

Anonymous said...

hi can anyone guide me on the building the spider web concept on excel...

Anonymous said...

Awesome!! Thanks. :)

AC21 said...

when making the box plot the error bars come out all messed up, can u help me!!!!!

Booboo. said...

How about when you have multiple outliers? The Outlier series only shows the first value.

Booboo. said...

Sorry for the post, it problem was corrected by adding a dummy series for the Y axis value=0. Thanks

Anonymous said...

I am still confused about the min valid and max valid..is there any other way to draw the 'whiskers' using only mean, median, 1st quartile and 3rd quartile?Thanks!!

Stephane Hamel at immeria.net said...
This post has been removed by the author.
Stephane Hamel at immeria.net said...

"Min valid" indicates what is the minimum axis value to draw the boxplot. It's either the minimum value found in the data, or the minimum value in the acceptable range (1.5 * the interquartile range). Without this, your boxblot could miss its left side (if the smallest value in higher than 1.5 * interquartile). Same concept for Max valid.

Anonymous said...

you are worthless!

Stephane Hamel at immeria.net said...

Anonymous: I have decided to publish your comment because it is a profound mark of disrespect and I had to share my feelings.

You said you were confused about the use of "Min/Max Valid" and I tried to answer a first time.

You then replied saying it wasn't clear enough. So I tried to make it clearer.

Insulting me is really not the right way to get an answer. I have spent hours building this example and countless people found it useful.

If you spent just a couple of minutes trying different things with the values of "Min/Max valid" you would see how they are used.

So now, M. (or Miss) "Anonymous", go figure it out by yourself.

Anonymous said...

A different anonymous: Stephanie, thanks for your instructions. They are extremely helpful for first timers.When looking at the original instructions, I see the min/max valid using cells with an E value. These cells do not have data. What are the correct cells for your formula. Thanks.

Stephane Hamel at immeria.net said...

Friendly Anonymous :)
Where it says "E" in the formulas replace them with "D" and that should do the magic :)

TimK said...

on 1/25/2008 Peter said...

the formulas are
Max(D2;D3-1,5*(D6-D3)) and
Min(D8;D6+1,5*(D6-D3))

I think there's a typo in the first one, and these should be:
D2=Max(D1;D3-1,5*(D6-D3))
D7=Min(D8;D6+1,5*(D6-D3))

jennifer.page82 said...

Thanks to Gregory for clearing up my confusion. Definitely no need to apologize as it wasn't obvious to me! Great tutorial overall and much appreciated.

Anonymous said...

Thanks Stephane - I've been so frustrated trying to create these in Excel, and your post has helped enormously. Much appreciated!

Anonymous said...

Stephanie,

Is it possible to get the directions separate? The reason I ask is that when I print it out, some pages have writing on top of other writings.

Jeanne

Stephane Hamel at immeria.net said...

Anonymous: if you have difficulties printing from your browser, copy & paste in Word and print from there...

Maxim said...

I'm having a problem with the left whisker of the plot starting inside the box (where the two boxes meet) and not on the left boundary. I've followed all the steps and can't seem to figure out what i did wrong. Can anyone assist?

-M@x

Anonymous said...

Maxim, you need to select the 1st box (the one that will be hidden). Don't hide it straight away or you can't select it.

I tried this method with the boxes vertical (I prefer them that way) and I didn't need to swap the axis or use and XY chart, although I'm not sure if it really depends on the data.

I also had problems with the valid max an min calculation, in the other the examples I found they are not interdependent and the formula does not need to be recursive.

Great tutorial
Graeme Melia

website design New York City said...

thank you! That looks like a great resource

Anonymous said...

Changing to MAX(D1:D3-1,5*(D6-D3))
and MIN(D6:D8+1,5*(D6-D3)), still does not help. It returned a value of zero to me!!!!

Anonymous said...

dear thank u very much it is really helpful
I just wanted to ask if there is another way tto find min and max value regardless of the formulas we have. Can we calculate it on our own?

Anonymous said...

Great Post. I bet the majority of the confusion about the Min/Max Valid formula arise from the fact that the comma is being used as a decimal and the semicolon as a comma. The following formula worked for me.

Min Valid
=MAX(D1,D3-1.5*(D6-D3))

Max Valid
=MIN(D8,D6+1.5*(D6-D3))

Anonymous said...

I'm having problems with the Outliers. You specify the x and y values to put in but I can only work with the whole series value. Not with the individual x and y series.... Anyone can help please?

Anonymous said...

so this means Excel 2007 don't actually have a Box and whisker plots function? we have to do it manually?

Anonymous said...

FANTASTIC! Been struggling my head off trying to figure this out. With the help of the comments for min and max value and how to set the minus error by using the box that I "remove" afterwards made it all good! The rest is normal formatting for labels and such.

Thanks a 1000 times!

Anonymous said...

Thank you so much. I really needed this for a project in school. Thanks to you I might get an A. Don't you feel great about helping the future of the world?

kinaze said...

Thanks a lot for sharing this Excel document and for taking the time to explain your process. Your time is really appreciated.

One of the thing I was struggling with, when trying to understand the box plot concept was the min valid and max valid functions (like @hubert). The thing is, since I had no abnormal values, Min valid and Max valid was equal to Min and Max. It drove me crazy as I was assuming something was wrong, when in fact all was good.

So I just want to point out : it is normal that Min valid = Min and Max valid = Max if there are no outliers.

Stephane Hamel at immeria.net said...

Kinaze: Yes! Exactly: min-valid = min and max-valid = max if there are no outliers.

Alexandru Atudosie said...

Hi could you have more boxes and whisker plots on the same graph. Basically I have three separate sets of data and would like the box and whisker plot to be shown individually, however all three should be on the same graph. I managed by following the instructions to create individual ones so far.
Thank you

Scott M said...

Brilliant workaround, Stéphane! As an 8th grade teacher, I've always been frustrated that you can't do Box plots in Excel. I'm going to share this with my whole district!