Often, one is interested in comparing means from two different populations. For example, suppose one wanted to characterize the difference, if one exists, between the mean heights of men and women? Towards this end, one might consider the mean heights seen in two simple random samples -- one of 50 men and the other of 50 women.

When the samples involved are taken independent of one another, we can mirror to a large extent what one does when dealing with proportions in two populations.

The null hypothesis is $\mu_1 = \mu_2$, while the alternative hypothesis is either $\mu_1 \neq \mu_2$, $\mu_1 \gt \mu_2$, or $\mu_1 \lt \mu_2$.

However, if we rewrite these so that the null hypothesis is $\mu_1 - \mu_2 = 0$ and the alternative hypothesis is either $\mu_1 - \mu_2 \neq 0$, $\mu_1 - \mu_2 \gt 0$, or $\mu_1 - \mu_2 \lt 0$, we can focus on a single distribution in our analysis -- the distribution of differences of sample means $\overline{x}_1 - \overline{x}_2$.

As seen with the difference of sample proportions, recall that if $X$ and $Y$ are normally distributed random variables, then $X-Y$ is also normally distributed, with a mean and variance given by $\mu_{X-Y} = \mu_X - \mu_Y$ and $\sigma^2_{X-Y} = \sigma^2_X + \sigma^2_Y$.

Recalling the Central Limit Theorem, one way to be reasonably assured we are dealing with two normal distributions is to be looking at distributions of sample means $\overline{x}_1$ and $\overline{x}_2$ where the corresponding sample sizes, $n_1$ and $n_2$, are both greater than $30$.

The Central Limit Theorem also tells us that the standard deviation of sample means $\overline{x}_1$ and $\overline{x}_2$ are given by $$SD(\overline{x}_1) = \frac{\sigma_1}{\sqrt{n}} \quad \textrm{ and } \quad SD(\overline{x}_2) = \frac{\sigma_2}{\sqrt{n}},$$ Thus, the standard deviation of the difference of sample means $\overline{x}_1 - \overline{x}_2$ is given by $$SD(\overline{x}_1 - \overline{x}_2) = \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}$$

At this point, if we knew all of the parameters involved, we could calculate a $z$-score for the difference in sample means for the samples actually taken, and proceed with the rest of the hypothesis test.

If however (as is often the case), we don't know the values of $\sigma_1$ and $\sigma_2$, we can use the standard error instead, where these values are approximated by the sample standard deviations, $s_1$ and $s_2$. This yields a test statistic of the form $$z = \frac{(\overline{x}_1 - \overline{x}_2) - 0}{\displaystyle{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}}$$

Similarly, one can find a confidence interval for $\mu_1 - \mu_2$ with bounds given by $$(\overline{x}_1 - \overline{x}_2) \pm z_{\alpha/2} \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$$Technically, when we start approximating $\sigma_i$ by $s_i$ we are now looking at the difference of two random variables that each follow a $t$ distribution.

When the sample sizes are large enough, these $t$ distributions are very close to normal distributions, and the difference of two normal distributions is itself a normal distribution, so everything discussed still gives good results.

However, when the sample sizes are small, (i.e., when either $n_1 \lt 30$ or $n_2 \lt 30$), the fact that we are really dealing with $t$-distributions and not normal distributions makes enough of a difference that we can no longer ignore it.

What's worse -- unlike the difference of two normal distributions which itself must be normal, the difference of two $t$-distributions is NOT actually a $t$-distribution.

The good news is that by using a special, adjusted degrees of freedom value, we can make the resulting distribution so close to a $t$-distribution that nobody will be able to tell the difference. The formula in question is $$df = \frac{\left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right)^2} {\frac{1}{n_1-1}\left(\frac{s_1^2}{n_1}\right)^2 + \frac{1}{n_2-1}\left(\frac{s_2^2}{n_2}\right)^2}$$

Yes - this can be a bit cumbersome to calculate. Some texts (like Triola's) make the observation that this number of degrees of freedom must always be at least the smaller of $n_1-1$ and $n_2-1$, and at most $n_1 + n_2 - 2$. To be conservative, you should always use the lower value (i.e., the one associated with the most variance). So we could take the degrees of freedom to be the smaller of $n_1 - 1$ and $n_2 - 1$. Granted, that approximation can be a poor choice as it can give you less than half the degrees of freedom to which you are entitled from the correct formula.

If the variances of the two samples disagree, some statisticians feel that using a $t$-distribution is not even appropriate, and that instead a Wilcoxen test for independent samples (a nonparametric measure) would be a better choice.

If we are willing to assume that their variances are equal, though, we can pool the data from the two groups to estimate the common variance and make that complicated formula for the degrees of freedom given above much simpler.

The common variance approximation is given by $$s^2_{p} = \frac{(n_1 - 1)s^2_1 + (n_2 - 1)s^2_2}{(n_1 - 1) + (n_2 - 1)}$$ Computing the associated pooled standard error, we have $$SE_{p}(\overline{x}_1 - \overline{x}_2) = \sqrt{s^2_p \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}$$

With these modifications, the test statistic $$t = \frac{(\overline{x}_1 - \overline{x}_2) - 0}{\displaystyle{\sqrt{s^2_p \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}}$$ then follows a $t$-distribution with $n_1 + n_2 - 2$ degrees of freedom.

So one might naturally ask, how do we know the variances are equal?

Suppose one is conducting an experiment. We might start by creating treatment groups as different random samples from the same population, in which case each treatment group begins with the same population variance. In this case, assuming equal variances is equivalent to assuming that the treatment doesn't change the variance.

When we test the difference between means, the null hypothesis is almost always that the true means are equal. If we are willing to stretch that idea to say that the treatments made no difference at all (for example, that the treatment is no different from the placebo offered as a control), then we might be willing to assume that the variances have remained equal.^{†}

We should test *that* hypothesis, however. Yes, that means that in the course of testing they hypothesis that two means are equal, we might have a step where we must run a secondary hypothesis test that two variances are equal!

Here, our null hypothesis is $\sigma^2_1 = \sigma^2_2$, and our alternative hypothesis is $\sigma^2_1 \neq \sigma^2_2$.

Under the assumption that our samples were drawn independently and that for each sample, this distribution is approximately normal, we can look at the test statistic of $$F = \frac{s^2_1}{s^2_2} \textrm{ where } s_1 \textrm{ is chosen so that } F \ge 1 \quad (\textrm{i.e., } s_1 \ge s_2)$$ which follows an $F$-distribution with $n_1 - 1$ degrees of freedom associated with the numerator and $n_2 - 1$ degrees of freedom associated with the denominator.

The selection of $s_1$ so that it forces $F \ge 1$ allows us to only concern ourselves with the right tail of the distribution (i.e., we only worry about exceeding a single critical value in the right tail). Depending on how we are determining that critical value, however, be aware that we may need to consequently halve the $\alpha$-level to compensate for the nicety of only having to deal with a right-tailed test.

As a specific example, if we have an $\alpha$-level of $0.5$, we will want to find the critical value so that $0.025$ is in the right tail of the related $F$ distribution.

Slightly modifying the initial example given at the top of this page, what if we wanted to know something about the difference in heights between men and *their spouses*? Now the samples taken are no longer independent. Knowing something about the height of a man present in one of the samples may tell us something about a woman in the other sample -- specifically, it may tell us something about the woman to whom he is married. (It may be the case, for example, that men generally prefer to marry women that are slightly shorter than they are.)

When there is a natural pairing that can be made between samples like this, we say we have **paired data** and consequently, **dependent samples**. (It goes without saying that if the data is paired, each sample must have the same size, $n$.)

In this case, we construct null and alternative hypotheses that say something about the mean of the differences seen in each pair. Specifically, letting $\mu_d$ denote the mean difference between paired elements, we have a null hypothesis of $\mu_d = 0$ and alternative hypotheses of either $\mu_d \neq 0$, $\mu_d \gt 0$, or $\mu_d \lt 0$.

Importantly, our hypotheses is saying something about the *single* set of differences seen. That means that once we have calculated all of these individual differences, the rest of the test (or the construction of the related confidence interval) will be no different than a one-sample hypothesis test for a mean (or a one-sample confidence interval for a mean). Rejoice! These are problems with which we have already dealt!

Under the assumption that $n \ge 30$ or the distribution of differences is approximately normal, the test statistic $$t = \frac{\overline{d} - \mu_d}{\displaystyle{\frac{s_d}{\sqrt{n}}}}$$ follows a $t$-distribution with $n-1$ degrees of freedom. Similarly, the confidence interval for $\mu_d$ has bounds $$\overline{d} \pm t_{\alpha/2} \frac{s_d}{\sqrt{n}}$$

† : This discussion of experimental design is a very lightly modified version of the same found in