Tutorial: basic statistics

Mean, median, and mode

The most commonly used statistical metric is the mean, which is also referred to as the average. The mean is computed by adding together all data points and dividing by the number of data points. Given the example above, the mean would be (50+55+52+45+58+60+40+56+58+49=523) divided by 10, or 52.3 customers per day. This can be considered the number of customers to expect on a typical day.

In this example, all of the data points are near the mean. However, what would happen if there were a few rogue data points that were very unusual, but not erroneous? Such data points are often referred to in statistics as outliers, and they can cause the mean to be of questionable value.

For example, what if there were two more days during which there were no customers at all? In this case we would have the same total number of customers, but 12 data points, for an average of 43.6 (523 divided by 12) customers per day. If these two "bad" business days were just a fluke, then the 43.6 average of customers a day is probably not an adequate reflection of how many customers to expect on a typical day of business.

In such a case, statisticians often turn to another measure known as the median. To find the median of a set of data points, we simply list them in order and take the value in the middle. If there is an even number of data elements, then there will be no single value in the middle, so we take the average of the middle two.

In the previous example with two additional days without customers, the values listed in order are: 0, 0, 40, 45, 49, 50, 52, 55, 56, 58, 58, 60. The middle two numbers are 50 and 52, so the median is the average of these two numbers: 51 (50+52=102 divided by 2). Note that the median value of 51 is much closer to the original mean (before we added the two outlying zeroes to the data set). If we really believe that the two days of no customers are not representative of a typical 12-day span, then the median is a better indication of what to expect on a typical day of business than the mean.

One final measure whose importance is less obvious is called the mode. The mode is simply the most common value in the data set. In the original data set above, all of the numbers occur once except for 58, which occurs twice. The mode of the original data set is therefore 58. In the case of a tie, there are multiple modes; so, in the expanded example where we added two days with no customers, there would be two modes: 0 and 58.

Why would we ever care about the mode? Actually, the mode is not used very often because it tends to be very close to the mean. However, for some probability distributions you will encounter cases where the mode can be quite different from the mean. In such cases, the mode is a much easier parameter to visualize when trying to describe a probability distribution.

Histograms

Often it is important to understand the “spread” of your data, i.e., how much individual values tend to differ from the mean, median, and mode. The simplest way is to create a graphical interpretation known as a histogram. To generate a histogram, you divide the range of data points into several smaller ranges of equal size, which are sometimes referred to as bins. You then count the number of data points in each range or bin. For example, the table below indicates one possible choice of dividing up the range for the original data set above.

As you can see, the histogram quickly shows how spread out the data is from the mean (52.3), median (51), and mode (58).

Populations vs. samples

Before going further, it's important to address the distinction between a sample and a population. The data presented above, for example, is a sample. It contains information regarding 10 consecutive business days; however, data for other business days is not available. The corresponding population would be a much larger set of data, consisting of the number of customers arriving in the store on all business days for which it was (and ever will be) open.

Basically, the population is the set of all possible data points for some measure, whereas a sample is some smaller subset of data that we have knowledge about. Often the population, including for example data for future days of business, is not available. Customer surveys are typical examples of samples, since information about the entire population, i.e., all customers, is seldom available.

In such cases, we need to restrict our attention to a sample. We should pay attention, however, to the sample size, since the bigger the sample size is the better it will describe the population. There is a theorem in statistics, which says that sample sizes of less than 30 should be treated with caution. Details of this theorem go though beyond the scope of this statistics overview. In the following, we will assume that our sample is a good representative of the population.

Why bother with this distinction between population and sample? Because another way to graphically present our data is as a percent of sample in each range as shown below.

In the case of a normal distribution, roughly 69% of all the data lie between one standard deviation to the left of the mean and one standard deviation to the right of the mean. For example, the standard deviation of the data shown above (the 5 dice rolled) was calculated to be 3.83 and the mean was found to be 17.46. Since this is a normal distribution, 69% of the time the sum of the 5 dice rolls was between 17.46-3.83 and 17.46+3.83, or between 13.63 and 21.29. This really means between 14 and 21, as sums of dice rolls are always whole numbers.

Another useful rule is that roughly 95% of all the data for a normal distribution lies within two standard deviations of the mean. In this example, that would correspond to values between 9.8 and 25.1, or between 10 and 25 as the sums of dice rolls must be whole numbers. Below is the normal distribution graph from above with lines inserted at various standard deviations (SD) from the mean.