R:
Suppose you have the following Excel file: tires.csv, which gives the brand of tire and mileage for 60 cars, and you are curious as to if there is a significant difference between the average mileage a car will get based on the brand of tire it has. So you wish to do an ANOVA test on the tire mileages by brand.
Upon importing the data, it is a good idea to check if the independent variable is a factor (and hence, a categorical variable) -- this is required for the ANOVA test in R. The is.factor()
function can help here. If the result is FALSE, we can create the necessary factor:
> is.factor(tires$Brands) [1] FALSE > brandsF = factor(tires$Brands)
You will want, of course, to check the assumptions of the test. Recall, you can check to see if the populations from which the samples were obtained are normal by inspecting the histograms for the samples, checking for outliers, checking skewness, and creating QQ-plots. Recall also that you can check for homogeneity of variances with the var.test()
function.
If everything looks good, now simply execute the following command to see the results of the ANOVA test:
> summary(aov(tires$Mileage ~ brandsF)) Df Sum Sq Mean Sq F value Pr(>F) brandsF 3 256.3 85.43 17.94 2.78e-08 *** Residuals 56 266.6 4.76 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The "F value
" of 17.94
above provides the test statistic for the test. Also, note the "Pr(>F)
" value of 2.78e-08
-- this is the $p$-value for the test. The 3 asterisks by this value indicate that the $p$-value is highly significant (i.e., less than $\alpha = 0.001$).
We conclude there is a highly significant difference between the mileages of cars that is connected to the brand of tire they have.
Sometimes, however, you don't have the data in such a convenient form.
As an example, suppose you also want to see if there is a difference in mileage as it relates to the size of the vehicle. This time, the data to which you have access is presented in a different way:
$$\begin{array}{c|c|c} \textrm{Small} & \textrm{Midsize} & \textrm{Large}\\\hline 44 & 36 & 29\\ 39 & 53 & 42\\ 37 & 43 & 38\\ 54 & 42 & 35\\ 39 & 52 & 34\\ 44 & 49 & 35\\ 42 & 41 & 30 \end{array}$$We can reconstruct the size-mileage pairs (represented through a size factor and a mileage vector) with the following:
sizeF = factor(rep(c("small","midsize","large"),each=7)) # remember, the independent variable must be a factor small.mileage = c(44,39,37,54,39,44,42) midsize.mileage = c(36,53,43,42,52,49,41) large.mileage = c(29,42,38,35,34,35,30) mileage = c(small.mileage,midsize.mileage,large.mileage)
Finally, we conduct the ANOVA test with:
> results = summary(aov(mileage ~ sizeF)) > results Df Sum Sq Mean Sq F value Pr(>F) sizeF 2 416.9 208.43 6.825 0.00622 ** Residuals 18 549.7 30.54 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note: despite its "pretty printing" above, the value results
is actually a list of lists. The fifth list is named "Pr(>F)"
and contains the $p$-value found. So we could access the $p$-value with
results[[1]][["Pr(>F)"]][1]
results[[1]][[5]][1]
The follow-up Scheffe tests can be found using functions found in the "DescTools" package. If you don't have this package installed, you can install it with install.packages("DescTools")
.
(Note: if you are asked during the installation of the above package whether or not you you wish to "install from sources the package which needs compilation?", respond with "no".)
To continue with the size vs. mileage example above, we can then conduct the follow-up Scheffe test with:
> library(DescTools) > ScheffeTest(aov(mileage ~ sizeF)) Posthoc multiple comparisons of means : Scheffe Test 95% family-wise confidence level $sizeF diff lwr.ci upr.ci pval midsize-large 10.428571 2.552566 18.304576 0.0088 ** small-large 8.000000 0.123995 15.876005 0.0461 * small-midsize -2.428571 -10.304576 5.447434 0.7176 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The $p$-values for each pair tested are in the column labeled pval
. From this column, it is clear that at the $\alpha = 0.05$ significance level, there is a significant difference between the large category and the other two categories.