Tech Tips: ANOVA and Scheffe Tests

To conduct an analysis of variance (ANOVA) test ...

R:
Suppose you have the following Excel file: tires.csv, which gives the brand of tire and mileage for 60 cars, and you are curious as to if there is a significant difference between the average mileage a car will get based on the brand of tire it has. So you wish to do an ANOVA test on the tire mileages by brand.

Upon importing the data, it is a good idea to check if the independent variable is a factor (and hence, a categorical variable) -- this is required for the ANOVA test in R. The is.factor() function can help here. If the result is FALSE, we can create the necessary factor:
```
> is.factor(tires$Brands)
[1] FALSE

> brandsF = factor(tires$Brands)
```
You will want, of course, to check the assumptions of the test. Recall, you can check to see if the populations from which the samples were obtained are normal by inspecting the histograms for the samples, checking for outliers, checking skewness, and creating QQ-plots. Recall also that you can check for homogeneity of variances with the var.test() function.

If everything looks good, now simply execute the following command to see the results of the ANOVA test:
```
> summary(aov(tires$Mileage ~ brandsF))
            Df Sum Sq Mean Sq F value   Pr(>F)    
brandsF      3  256.3   85.43   17.94 2.78e-08 ***
Residuals   56  266.6    4.76                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```
The "F value" of 17.94 above provides the test statistic for the test. Also, note the "Pr(>F)" value of 2.78e-08 -- this is the $p$-value for the test. The 3 asterisks by this value indicate that the $p$-value is highly significant (i.e., less than $\alpha = 0.001$).

We conclude there is a highly significant difference between the mileages of cars that is connected to the brand of tire they have.

Sometimes, however, you don't have the data in such a convenient form.

As an example, suppose you also want to see if there is a difference in mileage as it relates to the size of the vehicle. This time, the data to which you have access is presented in a different way:
$$\begin{array}{c|c|c} \textrm{Small} & \textrm{Midsize} & \textrm{Large}\\\hline 44 & 36 & 29\\ 39 & 53 & 42\\ 37 & 43 & 38\\ 54 & 42 & 35\\ 39 & 52 & 34\\ 44 & 49 & 35\\ 42 & 41 & 30 \end{array}$$
We can reconstruct the size-mileage pairs (represented through a size factor and a mileage vector) with the following:
```
sizeF = factor(rep(c("small","midsize","large"),each=7))
# remember, the independent variable must be a factor

small.mileage = c(44,39,37,54,39,44,42)
midsize.mileage = c(36,53,43,42,52,49,41)
large.mileage = c(29,42,38,35,34,35,30)

mileage = c(small.mileage,midsize.mileage,large.mileage)
```
Finally, we conduct the ANOVA test with:
```
> results = summary(aov(mileage ~ sizeF))
> results
            Df Sum Sq Mean Sq F value  Pr(>F)   
sizeF        2  416.9  208.43   6.825 0.00622 **
Residuals   18  549.7   30.54                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```
Note: despite its "pretty printing" above, the value results is actually a list of lists. The fifth list is named "Pr(>F)" and contains the $p$-value found. So we could access the $p$-value with

results[[1]][["Pr(>F)"]][1]

or even more simply with

results[[1]][[5]][1]

The follow-up Scheffe tests can be found using functions found in the "DescTools" package. If you don't have this package installed, you can install it with install.packages("DescTools").

(Note: if you are asked during the installation of the above package whether or not you you wish to "install from sources the package which needs compilation?", respond with "no".)

To continue with the size vs. mileage example above, we can then conduct the follow-up Scheffe test with:
```
> library(DescTools)
> ScheffeTest(aov(mileage ~ sizeF))

  Posthoc multiple comparisons of means : Scheffe Test 
    95% family-wise confidence level

$sizeF
                   diff     lwr.ci    upr.ci   pval    
midsize-large 10.428571   2.552566 18.304576 0.0088 ** 
small-large    8.000000   0.123995 15.876005 0.0461 *  
small-midsize -2.428571 -10.304576  5.447434 0.7176    

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```
The $p$-values for each pair tested are in the column labeled pval. From this column, it is clear that at the $\alpha = 0.05$ significance level, there is a significant difference between the large category and the other two categories.