## Shape, Center, and Spread of a Distribution

A population parameter is a characteristic or measure obtained by using all of the data values in a population.

A sample statistic is a characteristic or measure obtained by using data values from a sample.

The parameters and statistics with which we first concern ourselves attempt to quantify the "center" and "spread" (i.e., variability) of a data set. Note, there are several different measures of center and several different measures of spread that one can use -- one must be careful to use appropriate measures given the shape of the data's distribution, the presence of extreme values, and the nature and level of the data involved.

### The Shape of a Distribution

We can characterize the shape of a data set by looking at its histogram.

First, if the data values seem to pile up into a single "mound", we say the distribution is unimodal. If there appear to be two "mounds", we say the distribution is bimodal. If there are more than two "mounds", we say the distribution is multimodal.

Second, we focus on whether the distribution is symmetric, or if it has a longer "tail" on one side or another. In the case where there is a longer "tail", we say the distribution is skewed in the direction of the longer tail. In the case where the longer tail is associated with larger data values, we say the distribution is skewed right or (positively skewed). In the case where the longer tail is associated with smaller (or more negative) values, we say the distribution is skewed left or (negatively skewed).

If the distribution is symmetric, we will often need to check if it is roughly bell-shaped, or has a different shape. In the case of a distribution where each rectangle is roughly the same height, we say we have a uniform distribution.

The below graphic gives a few examples of the aforementioned distribution shapes.

### Measures of Center

• For interval or ratio level data, one measure of center is the mean. A population mean is denoted by $\mu$, while the sample mean is denoted by $\overline{x}$. Assuming the population has size $N$, a sample has size $n$, and $x$ spans across all available data values in the population or sample, as appropriate, we have $$\mu = \frac{\sum x}{N} \quad \textrm{ and } \quad \overline{x} = \frac{\sum x}{n}$$

• The median, denoted by $Q_2$ (or med) is the middle value of a data set when it is written in order. In the case of an even number of data values (and thus no exact middle), it is the average of the middle two data values. Unlike the mean, it can be used for ordinal data, and it is not affected by the presence of extreme values in the data set.

• The mode is the most frequent data value in the population or sample. There can be more than one mode, although in the case where there are no repeated data values, we say there is no mode. Modes can be used even for nominal data.

• The midrange is just the average of the highest and lowest data values. While easily understood, it is strongly affected by extreme values in the data set, and does not reliably find the center of a distribution.

In addition to knowing where the center is for a given distribution, we often want to know how "spread out" the distribution is -- this gives us a measure of the variability of values taken from this distribution. The below graphic shows the general shape of three symmetric unimodal distributions with identical measures of center, but very different amounts of "spread".

• The range is technically the difference between the highest and lowest values of a distribution, although it is often reported by simply listing the minimum and maximum values seen. It is strongly affected by extreme values present in the distribution.

• When the mean is the most appropriate measure of center, then the most appropriate measure of spread is the standard deviation. This measurement is obtained by taking the square root of the variance -- which is essentially the average squared distance between the data values and the mean.

As such, the population variance, $\sigma^2$, and population standard deviation, $\sigma$, are given by $$\sigma^2 = \frac{\sum (x-\mu)^2}{N} \quad \textrm{ and } \quad \sigma = \sqrt{\frac{\sum (x-\mu)^2}{N}}$$ When dealing with a sample, a slight alteration to the denominators in these formulas must be made in order for the resulting statistics (denoted $s^2$ and $s$, respectively) to be unbiased estimates of the corresponding population parameters, as seen below. $$s^2 = \frac{\sum (x-\overline{x})^2}{n-1} \quad \textrm{ and } \quad s = \sqrt{\frac{\sum (x-\overline{x})^2}{n-1}}$$

• When the median is the most appropriate measure of center, then the interquartile range (or IQR) is the most appropriate measure of spread. When the data are sorted, the IQR is simply the range of the middle half of the data. If the data has quartiles $Q_1, Q_2, Q_3, Q_4$ (noting that $Q_2$ is the median and $Q_4$ is the maximum value), then $$IQR = Q_3 - Q_1$$ Unlike the range itself, the IQR is not easily affected by the presence of extreme data values.

### Determining Significant Skewness

Note, the presence of skewness (or outliers) can affect where the measures of center are located relative to one another, as the below graphic suggests.

As can be seen, when significant skewness is present, the mean and median end up in different places. Turning this around, if the mean and median are far enough apart, we can determine if an observed skewness is significant.

To this end, Pearson's Skewness Index, I, is defined as $$I = \frac{3(\overline{x} - Q_2)}{s}$$ As for whether or not the mean and median are far enough apart (relative to the spread of the distribution), we say that if $|I| \ge 1$, then the data set is significantly skewed.

### Identifying Outliers

An outlier is a data value significantly far removed from the main body of a data set. Recall that in calculating the IQR we measure the span of the central half of a data set, from $Q_1$ to $Q_3$. It stands to reason that if a data value is too far removed from this interval, we should call it an outlier. Of course, we expect values to be farther away from the center (here, $Q_2$) when the spread (here, the IQR) is large, and closer to center when the spread is small. With this in mind, we say any value outside of the following interval is an outlier. $$(Q_1 - 1.5 \times IQR, Q_3 + 1.5 \times IQR)$$

One might wonder where the $1.5$ in the above interval comes from -- Paul Velleman, a statistician at Cornell University, was a student of John Tukey, who invented this test for outliers. He wondered the same thing. When he asked Tukey, "Why 1.5?", Tukey answered, "Because 1 is too small and 2 is too large."

### Chebyshev's Theorem

Amazingly, even if it is inappropriate to use the mean and the standard deviation as the measures of center and spread, there is an algebraic relationship between them that can be exploited in any distribution.

This relationship is described by Chebyshev's Theorem, which concludes that the proportion of values from any data set that lie within $k$ standard deviations of the mean is at least $$1 - \frac{1}{k^2} \quad \textrm{ where } k \gt 1$$

As an example, for any data set, at least 75% of the data will like in the interval $(\overline{x} - 2s, \overline{x} + 2s)$.