Tech Tips: Chi-Square Distributions

Calculating Heights of Points on the Graph of a $\chi^2$ Function

To find the height at $x$ of the graph of a $\chi^2$ function associated with $df$ degrees of freedom (and centered at $0$),

R: use the function
```
dchisq(x,df)
```
Excel: use the function
```
CHISQ.DIST(x,df,FALSE)
```

Calculating Probabilities Associated with the $\chi^2$ Distribution

To find the probability that a random variable following a $\chi^2$ distribution with $df$ degrees of freedom results in a value less than $x$ (i.e. the area under the chi-square distribution to the left of $x$.)

R: use the function
```
pchisq(x,df)
```
As an example, to find the probability that a random variable following a chi-squared distribution with $7$ degrees of freedom is less than $14.06714$, one can use:
```
> pchisq(14.06714,df=7)
[1] 0.95
```
Excel: use the function
```
CHISQ.DIST(x,df,TRUE)
```
The last argument in the function above, when $TRUE$, indicates an area under the curve should be returned, instead of a height.

Inverse Calculations Related to the $\chi^2$ Distribution

To find the $x$ value for which a random variable following a chi-squared distribution with $df$ degrees of freedom will produce an outcome less than $x$ with some given probability of $p$ (or equivalently, the $x$ value where there is an area of $p$ to the left of $x$ and under the related chi-squared distribution curve) ...

R: use the function
```
qchisq(p,df)
```
As an example, to find the 95th percentile of the $\chi^2$ distribution with 7 degrees of freedom, one can use:
```
> qchisq(0.95,df=7)
[1] 14.06714
```
Excel: use the function
```
CHISQ.INV(p,df)
```

Goodness of Fit Tests

To conduct a goodness-of-fit test,

R: use the function
```
chisq.test(x,p)
```
where $x$ is a vector of observed counts, and $p$ is a same-length vector of expected proportions

As an example, suppose in some random sample of collected wild tulips, 81 are red, 50 are yellow, and 27 are white. One wishes to test the claim that these colored tulips are present in equal proportions in the wild.

Noting that if these three colors of tulips were in equal proportions, one third would be red, one third would be yellow, and one third would be white, and consequently, of the 158 total number of tulips in our sample, our expectation would then have been that $52.66 \ge 5$ tulips of each color would be present, we can then proceed with the following:
```
> tulip = c(81,50,27)
> chisq.test(tulip,p=c(1/3,1/3,1/3))

    Chi-squared test for given probabilities

data:  tulip
X-squared = 27.886, df = 2, p-value = 8.803e-07
```
Seeing such a small $p$-value, which is less than any reasonable significance level (e.g., $0.05$), we reject the claim that these colored tulips are present in equal proportions in the wild.

Tests of Independence / Homogeneity of Proportions

To conduct a test for independence or homogeneity of proportions,

R:

To conduct a test for independence in R, it is useful to have our data in the form of a table.

As a concrete example, suppose we have polled 356 people with regard to their status as a smoker (i.e., "current", "former", or "never") and their socio-economic status (i.e., "low", "middle", or "high"). The results have been compiled in a CSV ("comma-separated-values") file named smoker.csv. One can load this data into an R table and display a table of counts for each possible combination of values for the two variables involved by placing this file in one's working directory (to find out where this is, type getwd() at the R prompt) and executing the following:
```
> smokerData = read.csv(file='smoker.csv',sep=',',header=T)
> smokeTable = table(smokerData$Smoke,smokerData$SES)
> smokeTable

          High Low Middle
  current   51  43     22
  former    92  28     21
  never     68  22      9
```
If you are curious, the sep argument tells R what character is being used as a "separating delimiter", and the header argument tells R to expect headers naming the categorical variables involved at the top of the file. In this example, they are named "Smoke" and "SES".

Alternatively, one can create a table manually (without counting any raw data), by doing something similar to the following:
```
> smoke = matrix(c(51,43,22,92,28,21,68,22,9),ncol=3,byrow=TRUE)
> colnames(smoke) = c("High","Low","Middle")
> rownames(smoke) = c("current","former","never")
> smokeTable = as.table(smoke)
> smokeTable

          High Low Middle
  current   51  43     22
  former    92  28     21
  never     68  22      9
```
Once you have the table constructed -- either from the raw data or through manual construction -- performing a test for independence is easy. Just ask R to give you a summary of the table, as shown below.
```
> summary(smokeTable)
Number of cases in table: 356
Number of factors: 2
Test for independence of all factors:
    Chisq = 18.51, df = 4, p-value = 0.0009808
```
Here, we can see that the $p$-value is very, very small. As such, we would expect to reject the null hypothesis (at traditional levels of significance, like $\alpha = 0.05$) that one's socio-economic status and one's status as a smoker are independent.

Alternatively, if one wants to get to the $p$-value for the test more quickly (i.e., without constructing a table with named rows and columns), one can also just apply the chisq.test() function to the corresponding matrix, as shown below:
```
chisq.test(matrix(c(51,43,22,92,28,21,68,22,9),ncol=3,byrow=TRUE))

      Pearson's Chi-squared test

data:  matrix(c(51, 43, 22, 92, 28, 21, 68, 22, 9), ncol = 3, byrow = TRUE)
X-squared = 18.51, df = 4, p-value = 0.0009808
```