To find the height at $x$ of the graph of a $\chi^2$ function associated with $df$ degrees of freedom (and centered at $0$),
R: use the function
dchisq(x,df)
Excel: use the function
CHISQ.DIST(x,df,FALSE)
To find the probability that a random variable following a $\chi^2$ distribution with $df$ degrees of freedom results in a value less than $x$ (i.e. the area under the chi-square distribution to the left of $x$.)
R: use the function
pchisq(x,df)
As an example, to find the probability that a random variable following a chi-squared distribution with $7$ degrees of freedom is less than $14.06714$, one can use:
> pchisq(14.06714,df=7) [1] 0.95
Excel: use the function
CHISQ.DIST(x,df,TRUE)
The last argument in the function above, when $TRUE$, indicates an area under the curve should be returned, instead of a height.
To find the $x$ value for which a random variable following a chi-squared distribution with $df$ degrees of freedom will produce an outcome less than $x$ with some given probability of $p$ (or equivalently, the $x$ value where there is an area of $p$ to the left of $x$ and under the related chi-squared distribution curve) ...
R: use the function
qchisq(p,df)
As an example, to find the 95th percentile of the $\chi^2$ distribution with 7 degrees of freedom, one can use:
> qchisq(0.95,df=7) [1] 14.06714
Excel: use the function
CHISQ.INV(p,df)
To conduct a goodness-of-fit test,
R: use the function
chisq.test(x,p)where $x$ is a vector of observed counts, and $p$ is a same-length vector of expected proportions
As an example, suppose in some random sample of collected wild tulips, 81 are red, 50 are yellow, and 27 are white. One wishes to test the claim that these colored tulips are present in equal proportions in the wild.
Noting that if these three colors of tulips were in equal proportions, one third would be red, one third would be yellow, and one third would be white, and consequently, of the 158 total number of tulips in our sample, our expectation would then have been that $52.66 \ge 5$ tulips of each color would be present, we can then proceed with the following:
> tulip = c(81,50,27) > chisq.test(tulip,p=c(1/3,1/3,1/3)) Chi-squared test for given probabilities data: tulip X-squared = 27.886, df = 2, p-value = 8.803e-07
Seeing such a small $p$-value, which is less than any reasonable significance level (e.g., $0.05$), we reject the claim that these colored tulips are present in equal proportions in the wild.
To conduct a test for independence or homogeneity of proportions,
R:
To conduct a test for independence in R, it is useful to have our data in the form of a table.
As a concrete example, suppose we have polled 356 people with regard to their status as a smoker (i.e., "current", "former", or "never") and their socio-economic status (i.e., "low", "middle", or "high"). The results have been compiled in a CSV ("comma-separated-values") file named smoker.csv. One can load this data into an R table and display a table of counts for each possible combination of values for the two variables involved by placing this file in one's working directory (to find out where this is, type getwd()
at the R prompt) and executing the following:
> smokerData = read.csv(file='smoker.csv',sep=',',header=T) > smokeTable = table(smokerData$Smoke,smokerData$SES) > smokeTable High Low Middle current 51 43 22 former 92 28 21 never 68 22 9
If you are curious, the sep
argument tells R what character is being used as a "separating delimiter", and the header
argument tells R to expect headers naming the categorical variables involved at the top of the file. In this example, they are named "Smoke" and "SES".
Alternatively, one can create a table manually (without counting any raw data), by doing something similar to the following:
> smoke = matrix(c(51,43,22,92,28,21,68,22,9),ncol=3,byrow=TRUE) > colnames(smoke) = c("High","Low","Middle") > rownames(smoke) = c("current","former","never") > smokeTable = as.table(smoke) > smokeTable High Low Middle current 51 43 22 former 92 28 21 never 68 22 9
Once you have the table constructed -- either from the raw data or through manual construction -- performing a test for independence is easy. Just ask R to give you a summary of the table, as shown below.
> summary(smokeTable) Number of cases in table: 356 Number of factors: 2 Test for independence of all factors: Chisq = 18.51, df = 4, p-value = 0.0009808
Here, we can see that the $p$-value is very, very small. As such, we would expect to reject the null hypothesis (at traditional levels of significance, like $\alpha = 0.05$) that one's socio-economic status and one's status as a smoker are independent.
Alternatively, if one wants to get to the $p$-value for the test more quickly (i.e., without constructing a table with named rows and columns), one can also just apply the chisq.test()
function to the corresponding matrix, as shown below:
chisq.test(matrix(c(51,43,22,92,28,21,68,22,9),ncol=3,byrow=TRUE)) Pearson's Chi-squared test data: matrix(c(51, 43, 22, 92, 28, 21, 68, 22, 9), ncol = 3, byrow = TRUE) X-squared = 18.51, df = 4, p-value = 0.0009808