Sunday, June 12, 2011

Average isn't necessarily typical - medians and histograms

A common but very simplistic management practice is the use of an average as a summary of a particular process.

For example, a manager may be concerned at an increase in the average number of sick days taken by staff.

However, a simple and somewhat humorous example shows the folly of this.

If we consider human beings, most people have two legs, a small proportion of people have one leg and an even smaller proportion have no legs. So if we were to calculate the average number of legs that a human being has then we would find that it is slightly less than 2. But wait a minute: this means that most human beings have an above average number of legs! Yet typically a human being has 2 legs...

The problem is that an average doesn't tell you much by itself, what is important is the distribution of the values. Now people with a little more statistical training may make a second error, they may assume that the values of interest have a normal (bell-shaped) distribution. But this is generally wrong too. For most variables of interest, there is an absolute lower bound on what is possible (i.e. zero) and probably an upper bound, and the variable is distributed with a long tail and isn't symmetrical. None of these things is true of a bell curve.

So what other measures to we have?

One useful measure is the median, the middle value of the data set. It is a value such that no more than 50% of values lie above it and no more than 50% of values lie below it The median gives a better feel for what is typical since it isn't affected by excessively large or excessively small values. It is a significant improvement on the average.

However, even better is to put the data into a histogram and to simply look at the pattern of the data.

Returning to the example of sick leave, you may find that the average number of days taken is 7 days, but the median may only be 4 days, and when the histogram is examined you will probably find that a small number of staff have taken much more than 7 days sick leave and so they have increased the average out of proportion to their actual numbers.

As a simple example, if you had 49 staff who averaged 4 days each and 1 person who took 154 days due to some serious illness, then the average across all of the staff would be 7 days, but the median would probably be less than 4. In such a situation, there is almost nothing that could be effectively done to reduce the average amount of sick leave since almost everyone is already taking less than the average.

However, publicizing the average could itself increase the amount of sick leave taken, since it could establish a norm of what staff in general are allegedly taking. So a person who is taking only 6 days may feel very proud of themselves for taking less than the average amount, and some staff who may not have taken much sick leave in the past may start taking more sick leave.

The lesson here is that statistics can be dangerous in the hands of those who don't understand their limitations. It may take a bit more effort to look at the distribution of the data but is only by doing so that you can identify if there is actually a problem or whether there is just the appearance of one, and also to be able to better target your efforts to where they may do the most good.



USEFUL RULE

In How to Measure Anything Douglas Hubbard provides the following useful rule for medians, which he calls the Rule of Five:
There is a 93% chance that the median of a population is between the smallest and largest values in any random sample of five from that population.

No comments:

Post a Comment