Often, one is interested in comparing means from two different populations. For example, suppose one wanted to characterize the difference, if one exists, between the mean heights of men and women? Towards this end, one might consider the mean heights seen in two simple random samples -- one of 50 men and the other of 50 women.

When the samples involved are taken independent of one another, we can mirror to a large extent what one does when dealing with proportions in two populations.

The null hypothesis is $\mu_1 = \mu_2$, while the alternative hypothesis is either $\mu_1 \neq \mu_2$, $\mu_1 \gt \mu_2$, or $\mu_1 \lt \mu_2$.

However, if we rewrite these so that the null hypothesis is $\mu_1 - \mu_2 = 0$ and the alternative hypothesis is either $\mu_1 - \mu_2 \neq 0$, $\mu_1 - \mu_2 \gt 0$, or $\mu_1 - \mu_2 \lt 0$, we can focus on a single distribution in our analysis -- the distribution of differences of sample means $\overline{x}_1 - \overline{x}_2$.

As seen with the difference of sample proportions, recall that if $X$ and $Y$ are normally distributed random variables, then $X-Y$ is also normally distributed, with a mean and variance given by $\mu_{X-Y} = \mu_X - \mu_Y$ and $\sigma^2_{X-Y} = \sigma^2_X + \sigma^2_Y$.

Recalling the Central Limit Theorem, one way to be reasonably assured we are dealing with two normal distributions is to be looking at distributions of sample means $\overline{x}_1$ and $\overline{x}_2$ where the corresponding sample sizes, $n_1$ and $n_2$, are both greater than $30$ -- another is to know that the underlying populations are normally distributed.

The Central Limit Theorem also tells us that the standard deviation of sample means $\overline{x}_1$ and $\overline{x}_2$ are given by $$SD(\overline{x}_1) = \frac{\sigma_1}{\sqrt{n_1}} \quad \textrm{ and } \quad SD(\overline{x}_2) = \frac{\sigma_2}{\sqrt{n_2}},$$ Thus, the standard deviation of the difference of sample means $\overline{x}_1 - \overline{x}_2$ is given by $$SD(\overline{x}_1 - \overline{x}_2) = \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}$$

At this point, if we knew all of the parameters involved, we could calculate a $z$-score for the difference in sample means for the samples actually taken, and proceed with the rest of the hypothesis test.

If however (as is often the case), we don't know the values of $\sigma_1$ and $\sigma_2$, we can use the standard error instead -- provided the sample standard deviations $s_1$ and $s_2$ well-approximate the values of $\sigma_1$ and $\sigma_2$, respectively. This yields a test statistic of the form $$z = \frac{(\overline{x}_1 - \overline{x}_2) - 0}{\displaystyle{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}}$$

Similarly, one can find a confidence interval for $\mu_1 - \mu_2$ with bounds given by $$(\overline{x}_1 - \overline{x}_2) \pm z_{\alpha/2} \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$$So the above works well when $s_1$ and $s_2$ well-approximate the values of $\sigma_1$ and $\sigma_2$ -- something that can be reasonably relied upon when $n_1$ and $n_2$ are large enough (i.e., both greater than 30). But what about when the sample size is small?

When at least one sample size is small, (i.e., when either $n_1 \lt 30$ or $n_2 \lt 30$, or when both of these occur), the errors contributed by estimating $\sigma_i$ with $s_i$ make enough of a difference that we can no longer ignore it.

The good news is that by using a special, adjusted degrees of freedom value, the distribution of $$t = \frac{(\overline{x}_1 - \overline{x}_2) - 0}{\displaystyle{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}}$$ is so close to a $t$-distribution that nobody will be able to tell the difference. The formula in question is $$df = \frac{\left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right)^2} {\frac{1}{n_1-1}\left(\frac{s_1^2}{n_1}\right)^2 + \frac{1}{n_2-1}\left(\frac{s_2^2}{n_2}\right)^2}$$

Yes - this can be a bit cumbersome to calculate. So instead, one might observe that this number of degrees of freedom must always be at least the smaller of $n_1-1$ and $n_2-1$, and at most $n_1 + n_2 - 2$. To be conservative, you should always use the lower value (i.e., the one associated with the most variance). Consequently, some statisticians suggest one takes the degrees of freedom to be the smaller of $n_1 - 1$ and $n_2 - 1$.^{†}

Importantly, though -- when conducting a means test on two samples that are not both large, **if the variances of two samples disagree, other statisticians (like us) feel that using a $t$-distribution is not even appropriate**, and that instead a Wilcoxon test for independent samples (a nonparametric test) would be a better choice.

If we are willing to assume that their variances are equal, though, we can pool the data from the two groups to both estimate the common variance and make that complicated formula for the degrees of freedom given above much simpler.

The common variance approximation is given by $$s^2_{p} = \frac{(n_1 - 1)s^2_1 + (n_2 - 1)s^2_2}{(n_1 - 1) + (n_2 - 1)}$$ Computing the associated pooled standard error, we have $$SE_{p}(\overline{x}_1 - \overline{x}_2) = \sqrt{s^2_p \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}$$

With these modifications, the test statistic $$t = \frac{(\overline{x}_1 - \overline{x}_2) - 0}{\displaystyle{\sqrt{s^2_p \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}}$$ then follows a $t$-distribution with $n_1 + n_2 - 2$ degrees of freedom.

So one might naturally ask, how do we know the variances are equal?

Suppose one is conducting an experiment. We might start by creating treatment groups as different random samples from the same population, in which case each treatment group begins with the same population variance. In this case, assuming equal variances is equivalent to assuming that the treatment doesn't change the variance.

When we test the difference between means, the null hypothesis is almost always that the true means are equal. If we are willing to stretch that idea to say that the treatments made no difference at all (for example, that the treatment is no different from the placebo offered as a control), then we might be willing to assume that the variances have remained equal.^{††}

We should test *that* hypothesis, however. Yes, that means that in the course of testing the hypothesis that two means are equal, we might have a step where we must run a secondary hypothesis test that two variances are equal!

Here, our null hypothesis is $\sigma^2_1 = \sigma^2_2$, and our alternative hypothesis is $\sigma^2_1 \neq \sigma^2_2$.

Under the assumption that our samples were drawn independently and that for each sample, this distribution is approximately normal, we can look at the test statistic of $$F = \frac{s^2_1}{s^2_2} \textrm{ where } s_1 \textrm{ is chosen so that } F \ge 1 \quad (\textrm{i.e., } s_1 \ge s_2)$$ which follows an $F$-distribution with $n_1 - 1$ degrees of freedom associated with the numerator and $n_2 - 1$ degrees of freedom associated with the denominator.

The selection of $s_1$ so that it forces $F \ge 1$ allows us to only concern ourselves with the right tail of the distribution (i.e., we only worry about exceeding a single critical value in the right tail). Depending on how we are determining that critical value, however, be aware that we may need to consequently halve the $\alpha$-level to compensate for the nicety of only having to deal with a right-tailed test.

As a specific example, if we have an $\alpha$-level of $0.5$, we will want to find the critical value so that $0.025$ is in the right tail of the related $F$ distribution.

Slightly modifying the initial example given at the top of this page, what if we wanted to know something about the difference in heights between men and *their spouses*? Now the samples taken are no longer independent. Knowing something about the height of a man present in one of the samples may tell us something about a woman in the other sample -- specifically, it may tell us something about the woman to whom he is married. (It may be the case, for example, that men generally prefer to marry women that are slightly shorter than they are.)

When there is a natural pairing that can be made between samples like this, we say we have **paired data** and consequently, **dependent samples**. (It should go without saying that if the data is paired, each sample must have the same size, $n$.)

In this case, we construct null and alternative hypotheses that say something about the mean of the differences seen in each pair. Specifically, letting $\mu_d$ denote the mean difference between paired elements, we have a null hypothesis of $\mu_d = 0$ and alternative hypotheses of either $\mu_d \neq 0$, $\mu_d \gt 0$, or $\mu_d \lt 0$.

Importantly, our hypotheses is saying something about the *single* set of differences seen. That means that once we have calculated all of these individual differences, the rest of the test (or the construction of the related confidence interval) will be no different than a one-sample hypothesis test for a mean (or a one-sample confidence interval for a mean). Rejoice! These are problems with which we have already dealt!

Under the assumption that $n \ge 30$ or the distribution of differences is approximately normal, the test statistic $$t = \frac{\overline{d} - \mu_d}{\displaystyle{\frac{s_d}{\sqrt{n}}}}$$ follows a $t$-distribution with $n-1$ degrees of freedom. Similarly, the confidence interval for $\mu_d$ has bounds $$\overline{d} \pm t_{\alpha/2} \frac{s_d}{\sqrt{n}}$$

† : Granted, this gives one less than half the degrees of freedom to which one is entitled from the correct formula.

†† : This discussion of experimental design is a very lightly modified version of the same found in