One often has to deal with categorical variables in statistics (i.e., variables at the nominal or ordinal level of measurement). In R, these are best dealt with through the use of factors.
For example, fertilizers typically have three main ingredients, nitrogen (N), phosphorous (P), and potassium (K). Perhaps one is conducting an experiment to determine which of these ingredients best promotes root development, and has four treatment groups (one for each ingredient, and a control group that receives none of the ingredients).
Plants numbered 1 through 12 are randomly assigned to one of the four treatment groups so that each group ends up with 3 members. We could represent this process with the vector named f
, as shown below -- where the treatment given to plant $i$ corresponds to the $i^{th}$ element of the vector:
f = c("K","K","none","N","P","P","N","N","none","P","K","none")
To make R aware that the values listed are values associated with a categorical variable (which are called levels in R), we convert this vector into a factor with the factor()
function:
fertilizer = factor(f)
Asking R to show the contents of f
and fertilizer
suggests there is a subtle difference between the two variables, as shown below:
> f [1] "K" "K" "none" "N" "P" "P" "N" "N" "none" "P" "K" "none" > fertilizer [1] K K none N P P N N none P K none Levels: K N none P
First, it is clear that R is no longer considering the elements of the factor as strings of characters, given the absence of double-quotes. Second (and more importantly), additional information in the form of "Levels: K N none P
" is given. The levels shown correspond to the unique values seen in the vector $f$ (i.e., the categories that represent the treatment groups).
There are other differences between a vector and a factor, which we can see if we use the str(x)
function. This function in R displays a compact representation of the internal structure of any R variable $x$. Let's see what happens when we apply it to both f
and fertilizer
:
> str(f) chr [1:12] "K" "K" "none" "N" "P" "P" "N" "N" "none" "P" "K" "none" > str(fertilizer) Factor w/ 4 levels "K","N","none",..: 1 1 3 2 4 4 2 2 3 4 ...
Note how in the factor fertilizer
, the levels "K", "N", "none", and "P" are replaced by numbers 1, 2, 3, and 4, respectively. So internally, R only stores the numbers (indicating the level of each vector element) and (separately) the names of each unique level. (Interestingly, even if the vector's elements had been numerical, the levels are stored as strings of text.)
The way R internally stores factors is important when we want to combine them. Consider the following failed attempt to combine factors a.fac
and b.fac
:
> a.fac = factor(c("X","Y","Z","X")) > b.fac = factor(c("X","X","Y","Y","Z")) > factor(c(a.fac,b.fac)) [1] 1 2 3 1 1 1 2 2 3 Levels: 1 2 3
Notice how we lost the names associated with the different levels. There is a way to restore them -- but it would be better not to lose them in the first place! The as.character()
function can help here. This function can be used to force a factor back into a vector whose elements are the corresponding strings of text associated with its levels. For example, as.character(factor(c("X","Y")))
returns a vector equivalent to c("X","Y")
.
To combine two factors (with the same levels), we force them both back to vectors in the way just described, combine the vectors with c()
, and then convert the result back into a factor -- as shown below:
a.fac = factor(c("X","Y","Z","X")) b.fac = factor(c("X","X","Y","Y","Z")) factor(c(as.character(a.fac),as.character(b.fac)))
You can of course, also change the levels associated with a factor, using levels()
as the following suggests.
> a.fac = factor(c("X","Y","Z","X")) > a.fac [1] X Y Z X Levels: X Y Z > levels(a.fac) = c("A","B","C") > a.fac [1] A B C A Levels: A B C
How does the addition of factors as a data type in R help us do statistical work, you ask? Well, let us continue with the fertilizer example above as we attempt to answer that question. Suppose that the increase in root growth (measured in millimeters) for each plant is recorded after 3 weeks of treatment. These lengths are recorded in a vector named growth
:
growth = c(10,12,8,13,18,19,11,11,9,21,10,10)
Suppose we are interested in the mean growth in each of our treatment groups.
Armed with only vectors f
and growth
, we would need to create four additional vectors of positions in $f$ that matched each string "K", "N", "none", and "P", and then use these to create four more vectors of growth values for these categories via subsetting. Then, we would need to find the mean of each of these vectors. Sounds like a lot of work, right? Well, factors -- and one of the "apply" functions -- make this process simple in the extreme...
We can use the tapply(x,f,g)
function in R to apply a function to the values of a vector belonging to different categories, as determined by a factor. Here, $x$ is the vector, $f$ is the factor, and $g$ is the function to be applied. In other words, all we need to type is:
> tapply(growth,fertilizer,mean) K N none P 10.66667 11.66667 9.00000 19.33333Voila! It looks like phosphorus (P) really helps with root growth!
If we wanted to stop short of finding the means associated with the four fertilizer levels, and instead simply split up the growth
vector into four vectors, each consisting only of the values in growth
associated with a specific fertilizer level, we can use the split()
function:
> split.data = split(growth,fertilizer)Then, we can access the vectors associated with each level of fertilizer by following the variable name
split.data
with a dollar sign ($) and the level name (P, N, none, or K), as shown below:
> split.data$P [1] 18 19 21 > split.data$N [1] 13 11 11 > split.data$none [1] 8 9 10 > split.data$K [1] 10 12 10Another useful feature of factors is that one can impose an order on the levels when the factor is created. This is frequently useful when creating a factor to represent an ordinal variable. An example is shown below:
> p = c("Bears","Bears","Tigers","Bears","Lion","Tigers","Lion") > prizes = factor(p,levels=c("Lion","Tigers","Bears"),ordered=TRUE) > prizes [1] Bears Bears Tigers Bears Lion Tigers Lion Levels: Lion < Tigers < Bears > sort(prizes) # among other things, we can now sort the prizes factor in # accordance with this explicit order (instead of the default # alphabetical order) [1] Lion Lion Tigers Tigers Bears Bears Bears Levels: Lion < Tigers < Bears
Finally, as one more useful function to consider (although there are many others) before transitioning to a discussion about tables in R, the cut()
function allows us to create factors from numerical data by cutting up the continuum containing the data into different "bins", much like the breaks
argument of the hist()
function are used to establish the various rectangular bars/bins shown in a histogram. Indeed, the corresponding argument to the cut()
function is also called breaks
. Here's an example:
> xs = runif(10) > xs [1] 0.8675517 0.2721003 0.5774774 0.3887704 0.8033977 0.3176221 0.7910806 [8] 0.6419176 0.8728865 0.4013302 > bin.ids = cut(xs,breaks=seq(from=0,to=1,by=0.1),labels=FALSE) > bin.ids # i.e., which bin did each x in xs fall into? [1] 9 3 6 4 9 4 8 7 9 5Note, the
labels=FALSE
parameter above makes R return a vector of simple integer codes that can then be turned into a factor.
If instead levels=v
parameter passed to the cut()
function, as seen below:
> xs = runif(10)
> xs
[1] 0.8675517 0.2721003 0.5774774 0.3887704 0.8033977 0.3176221 0.7910806
[8] 0.6419176 0.8728865 0.4013302
> bin.ids = cut(xs,breaks=seq(from=0,to=1,by=0.1),
+ labels=c("A","B","C","D","E","F","G","H","I","J"))
> bin.ids
[1] I C F D I D H G I E
Levels: A B C D E F G H I J
If one wishes the result of cut()
to have an implicit order (like the "Lions, Tigers, and Bears" example above), one merely needs to add the argument ordered_result = TRUE
when it is called.
The result of applying cut()
can then be turned into a table showing the frequency of data values in each bin. "What's a table?", you say -- funny you should ask...
Factors can also be used to create tables in R, another important data type in terms of its relationship to statistics.
As an example, suppose that a sample of 7 people are asked the following questions in a study of workplace risk of tetanus infections:
Q1 = factor(c("Sometimes","Sometimes","Never","Always","Always","Sometimes","Sometimes")) Q2 = factor(c("Maybe","Maybe","Yes","Maybe","No","Yes","No"))Thinking that there might be a relationship between these two variables, we wish to construct a contingency table -- where the levels of one variable form the column headers and the levels of the other variable form the row headers, with the body of the table indicating how many subjects were associated with each possible pair of levels.
To create such a table in R, we simply use the table()
command, as shown below:
> t = table(Q1,Q2) > t Q2 Q1 Maybe No Yes Always 1 1 0 Never 0 0 1 Sometimes 2 1 1Tables can be made from 1, 2, or many more factors. Recalling the example used in the previous section to show what the
cut()
function does, note how a table made from the single associated factor that results, gives the frequency count for each level of this factor:
> xs [1] 0.8675517 0.2721003 0.5774774 0.3887704 0.8033977 0.3176221 0.7910806 [8] 0.6419176 0.8728865 0.4013302 > bin.ids = cut(xs,breaks=seq(from=0,to=1,by=0.1),labels=FALSE) > bin.ids [1] 9 3 6 4 9 4 8 7 9 5 > table(factor(bin.ids)) 3 4 5 6 7 8 9 1 2 1 1 1 1 3Getting back to the two-dimensional table
t
resulting from the answers to questions Q1 and Q2, let us explore some more ways tables can be used:
First -- and very similar to vectors -- one can extract individual values (or subsets of values) from a table. As an example, note that t[3,1]
gives one the value $2$, located in the 3rd row, 1st column.
If one wishes to extract the entire 1st column, one simply leaves out the row number (but still uses the comma):
> t[,1] Always Never Sometimes 1 0 2If one desires instead to extract (as a new table) columns 2 and 3 of
t
, one can use
> t[,2:3] Q2 Q1 No Yes Always 1 0 Never 0 1 Sometimes 1 1
If only the 3rd row is wanted, one simply leaves out the column number, and so on...
> t[3,] Maybe No Yes 2 1 1Note, the results above for
t[,1]
and t[3,]
are actually vectors -- the extra words that are shown result because the vector elements have been given names. This is not something peculiar to tables, however -- any vector can have its elements given names using the names()
function, as the following suggests:
> x = c(1,2,3) > x [1] 1 2 3 # executing the names() function tells us that x # currently has no names attached to it > names(x) NULL # the following gives the elements of x names "a", "b", and "c" # which we can see here results in x being displayed differently > names(x) = c("a","b","c") > x a b c 1 2 3 # executing the names() function after giving x names, # reveals the names given to it > names(x) [1] "a" "b" "c" # a vector that has names allows one to subset not by # numerical position, but by name > x["b"] b 2 # one can remove the names using a NULL assignment > names(x) = NULL > x [1] 1 2 3 > names(x) NULLOne can also produce new tables from existing ones. For example, suppose we wanted to see a table of relative frequencies instead of counts. Much like one might do with a vector, we simply divide the table by the sum of its elements:
> t/sum(t) Q2 Q1 Maybe No Yes Always 0.1428571 0.1428571 0.0000000 Never 0.0000000 0.0000000 0.1428571 Sometimes 0.2857143 0.1428571 0.1428571
When working with contingency tables we often have need of marginal totals (i.e., either row or column sums in a two-dimensional table). One way to accomplish this is through the use of the apply()
function, which allows us to apply any given function (here the sum()
function) to the values in the table associated with each value of a given variable.
> apply(t,1,sum) Always Never Sometimes 2 1 4Note, the second parameter being a 1 above tells R to find the sums of the values in the table associated with each value of the first table variable, Q1. That is to say, when the second parameter is a 1, R finds the row totals. Had we used a 2 instead, we would see the column totals:
> apply(t,2,sum) Maybe No Yes 3 2 2
However, R supplies another function, called addmargins()
that can find both of these vectors (and the grand total) in one command:
> addmargins(t) Q2 Q1 Maybe No Yes Sum Always 1 1 0 2 Never 0 0 1 1 Sometimes 2 1 1 4 Sum 3 2 2 7
As one final useful function, note that as.vector()
can collapse a table or factor into a vector. In the case of factors or tables with only one row, the result is obvious:
> f = factor(c("bob","fred","bob","bob","alice")) > as.vector(f) [1] "bob" "fred" "bob" "bob" "alice" Q1 = factor(c("Sometimes","Sometimes","Never","Always","Always","Sometimes","Sometimes")) > table(Q1) Q1 Always Never Sometimes 2 1 4 > as.vector(table(Q1)) [1] 2 1 4
In the case that a table has two rows, the columns are concatenated together to form one long vector, as seen below:
Q1 = factor(c("Sometimes","Sometimes","Never","Always","Always","Sometimes","Sometimes")) Q2 = factor(c("Maybe","Maybe","Yes","Maybe","No","Yes","No")) > table(Q1,Q2) Q2 Q1 Maybe No Yes Always 1 1 0 Never 0 0 1 Sometimes 2 1 1 > as.vector(table(Q1,Q2)) [1] 1 0 2 1 0 1 0 1 1