Shape, Center, and Spread of a Distribution

A population parameter is a characteristic or measure obtained by using all of the data values in a population.

A sample statistic is a characteristic or measure obtained by using data values from a sample.

The parameters and statistics with which we first concern ourselves attempt to quantify the "center" and "spread" (i.e., variability) of a data set. Note, there are several different measures of center and several different measures of spread that one can use -- one must be careful to use appropriate measures given the shape of the data's distribution, the presence of extreme values, and the nature and level of the data involved.

The Shape of a Distribution

We can characterize the shape of a data set by looking at its histogram.

First, if the data values seem to pile up into a single "mound", we say the distribution is unimodal. If there appear to be two "mounds", we say the distribution is bimodal. If there are more than two "mounds", we say the distribution is multimodal.

Second, we focus on whether the distribution is symmetric, or if it has a longer "tail" on one side or another. In the case where there is a longer "tail", we say the distribution is skewed in the direction of the longer tail. In the case where the longer tail is associated with larger data values, we say the distribution is skewed right or (positively skewed). In the case where the longer tail is associated with smaller (or more negative) values, we say the distribution is skewed left or (negatively skewed).

If the distribution is symmetric, we will often need to check if it is roughly bell-shaped, or has a different shape. In the case of a distribution where each rectangle is roughly the same height, we say we have a uniform distribution.

The below graphic gives a few examples of the aforementioned distribution shapes.

Measures of Center

Measures of Spread

In addition to knowing where the center is for a given distribution, we often want to know how "spread out" the distribution is -- this gives us a measure of the variability of values taken from this distribution. The below graphic shows the general shape of three symmetric unimodal distributions with identical measures of center, but very different amounts of "spread".

Determining Significant Skewness

Note, the presence of skewness (or outliers) can affect where the measures of center are located relative to one another, as the below graphic suggests.

As can be seen, when significant skewness is present, the mean and median end up in different places. Turning this around, if the mean and median are far enough apart, we can determine if an observed skewness is significant.

To this end, Pearson's Skewness Index, I, is defined as $$I = \frac{3(\overline{x} - Q_2)}{s}$$ As for whether or not the mean and median are far enough apart (relative to the spread of the distribution), we say that if $|I| \ge 1$, then the data set is significantly skewed.

Identifying Outliers

An outlier is a data value significantly far removed from the main body of a data set. Recall that in calculating the IQR we measure the span of the central half of a data set, from $Q_1$ to $Q_3$. It stands to reason that if a data value is too far removed from this interval, we should call it an outlier. Of course, we expect values to be farther away from the center (here, $Q_2$) when the spread (here, the IQR) is large, and closer to center when the spread is small. With this in mind, we say any value outside of the following interval is an outlier. $$(Q_1 - 1.5 \times IQR, Q_3 + 1.5 \times IQR)$$

One might wonder where the $1.5$ in the above interval comes from -- Paul Velleman, a statistician at Cornell University, was a student of John Tukey, who invented this test for outliers. He wondered the same thing. When he asked Tukey, "Why 1.5?", Tukey answered, "Because 1 is too small and 2 is too large."

Chebyshev's Theorem

Amazingly, even if it is inappropriate to use the mean and the standard deviation as the measures of center and spread, there is an algebraic relationship between them that can be exploited in any distribution.

This relationship is described by Chebyshev's Theorem, which concludes that the proportion of values from any data set that lie within $k$ standard deviations of the mean is at least $$1 - \frac{1}{k^2} \quad \textrm{ where } k \gt 1$$

As an example, for any data set, at least 75% of the data will like in the interval $(\overline{x} - 2s, \overline{x} + 2s)$.