The Normal Distribution

A normal distribution will be our first, and arguably most important example of a continuous probability distribution, so let's take a moment to describe what that is first...

Continuous Probability Distributions

A continuous probability distribution describes the probabilities of the possible values of a continuous random variable. Recall, that a continuous random variable is a random variable with a set of possible values that is infinite and uncountable. As a result, a continuous probability distribution cannot be expressed in tabular form. Instead, we must use an equation or formula (i.e., the probability distribution function -- in this context, also called a probability density function) to describe a continuous probability distribution.

Probabilities of continuous random variables (X) are defined as the area under the curve of its PDF. That is to say, $P(a \lt X \lt b)$ is the area under the curve between $x=a$ and $x=b$. Thus, only ranges of values can have a nonzero probability. The probability that a continuous random variable equals any particular value is always zero.

All probability density functions satisfy the following conditions:

For those familiar with calculus, we can also define a mean and variance for a continuous probability distribution with a range of $[a,b]$, as the following (although, we won't use these definitions beyond their mention here). $$\mu = \int_a^b xf(x)dx \quad \textrm{ and } \quad \sigma^2 = \int_a^b (x - \mu)^2 dx$$ The standard deviation for a continuous probability function is still defined as the square root of the variance.

The Standard Normal Curve

The standard normal curve is given by the following formula:

$$y=\frac{e^{-\frac{1}{2}x^2}}{\sqrt{2\pi}}$$

The graph of this curve is shown below:

Note, it has the following properties...

Using the definitions for mean and variance as it relates to continuous probability density functions, we can show that the standard normal curve has a mean of 0, and standard deviation of 1.

Some additional properties of interest...

By altering the formula for a standard normal curve slightly we build the entire family of normal curves...

Consider the following,

$$y = \frac{e^{-\frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2}}{\sigma\sqrt{2\pi}}$$

We have replaced $x$ with $\frac{x - \mu}{\sigma}$ which shifts the original curve to the right $\mu$ units, and horizontally stretches it by a factor of $\sigma$.

The presence of another $\sigma$ in the bottom denominator then vertically compresses the resulting graph by a factor of $\sigma$.

The net effect of which preserves the overall "bell-shape" of the original graph, while moving the mean to $\mu$, and changing the standard deviation to $\sigma$, leaving the area under the curve unchanged. (The area is still 1, since the curve was horizontally stretched and vertically compressed by the same factor.)

We associate this curve with a Normal distribution of mean $\mu$ and standard deviation $\sigma$.

The below graphic shows a few normal distributions of various means and standard deviations.

The Empirical Rule

For any normal distribution:

These approximations can, of course, be used in conjunction with the symmetry of a Normal distribution to approximate a few more proportions (such as the proportion of the distribution in the yellow, red, and blue vertical bands as shown below)

An Outlier Test

If upon looking at a distribution it appears to be unimodal, symmetric, and shaped in such a way that visually it appears to be "normally distributed", we can use the empirical rule to develop a test that can be applied to any data value in that distribution to determine whether or not it should be treated as an outlier.

Recall, according to the Empirical Rule, in any normal distribution we have the vast majority of our data (i.e., 99.7%) in the interval $(\overline{x} - 3\sigma,\overline{x} + 3\sigma)$. As such, approximating $\sigma$ by $s$ in a sample, we are suspicious of any data value outside of $$(\overline{x} - 3s,\overline{x} + 3s)$$ and consequently declare data values outside this interval to be outliers.

Z-scores

Incredibly important in statistics is the ability to compare how rare or unlikely two data values from two different distributions might be. For example, which is more unlikely, a 400 lb sumo wrestler or a 7.5 foot basketball player. It may seem like we are comparing apples and oranges here -- and in a sense, we are. However, when the distributions involved are both normal distributions, there is a way to make this comparison in a quantitative way.

Consider the following consequences of the Empirical Rule:

Data values more than 1 standard deviation away from the mean are relatively common, occurring with probability $0.32$. Values more than 2 standard deviations away from the mean are less common, occurring with probability of only $0.05$. Values more than 3 standard deviations away from the mean are very unlikely (so much so we mark their occurrence by labeling them as outliers), occurring with probability $0.003$.

Of course, we need not limit ourselves to distances measured as integer multiples of the standard deviation. With a bit of calculus, one can estimate the probability of seeing values more than $k$ standard deviations away from the mean for any positive real value $k$. In doing this, one finds (not surprisingly, given the shape of the distribution) that the probability continues to decrease as $k$ increases. As such, we can compare the rarity of two values -- even two values coming from two different distributions -- by comparing how many of their respective standard deviations they are from their respective means.

This measure, the number of standard deviations, $\sigma$, some $x$ is from the mean, $\mu$, in a normal distribution is called the $z$-score for $x$. The manner of its calculation is straight-forward, $$z = \frac{x - \mu}{\sigma}$$ (Note, for folks that haven't had calculus, one can look up probabilities associated with a normal distribution using a normal probability table and $z$-scores. One can also use the normalcdf(a,b) distribution function on many TI calculators to find the probability of falling between $a$ and $b$ in a standard normal distribution.)

Determining the Normality of a Data Set

Some of the inferences we will draw about a population given sample data depend upon the underlying distribution being a normal distribution. There are multiple ways to check how "normal" a given data set is, but for now it will suffice to

  1. Draw a histogram. If the histogram departs dramatically from a "bell-shaped curve", one should not assume the underlying distribution is normal.

  2. Check for outliers. Carefully investigate any that are found. If they were the result of a legitimate error, it may be safe to throw out that value and proceed. If not, or if there is more than one outlier, the distribution might not be normal.

  3. Check for skewness. If the histogram indicates a strong skew to the right or left, the underlying distribution is likely not normal.

  4. Check to see if the percentages of data within 1, 2 and 3 standard deviations is similar to what is predicted by the Empirical Rule. If they aren't, your distribution may not be normal. Alternatively, if you have a larger data set, you should consider making a normal quantile plot. If your data is normally distributed, you can expect this plot to show a pattern of points that is reasonably close to a straight line. (Note: if your pattern shows some systematic pattern that is clearly not a straight-line pattern, you should not assume your distribution is normal.) Normal quantile plots are tedious to do by hand, but certain computer programs can produce them quickly and easily. (See the qqnorm() function in R.)