## Tech Tips: Hypergeometric Distributions

### Calculating Hypergeometric Probabilities

To find the hypergeometric probability of seeing exactly $x$ white balls when drawing $k$ balls from an urn containing $m$ white balls and $n$ black balls, or equivalently $$P(x) = \frac{{}_m C_x \cdot {}_n C_{k-x}}{{}_{m+n} C_k}$$

• R: use the function

dhyper(x, m, n, k)


As an example, note that usually 50 potential jurors are held to compose a jury of 12. Suppose that this group of 50 has 15 females and 35 males. To find the probability that the jury will be made up of 4 females and 8 males, one could use the following:

> dhyper(4, 15, 35, 12)
[1] 0.2646333


• Excel: use the function

HYPGEOM.DIST(x, k, m, N, FALSE)


As an important difference from the corresponding R function above, note that here, $N$ represents the total number of balls, (i.e., $N = n + m$).

The last argument for this function, when $FALSE$, indicates that the probability returned should not be cumulative (i.e., it only returns $P(k)$, not $P(0) + P(1) + \cdots + P(k)$).

### Calculating Cumulative Hypergeometric Probabilities

Suppose one wishes to find the cumulative hypergeometric probability of seeing $x$ or fewer white balls when drawing $k$ balls from an urn containing $m$ white balls and $n$ black balls, or equivalently $$P(X <= x) = P(0) + P(1) + P(2) + \cdots + P(x) = \sum_{0 \le i \le x} \frac{{}_m C_i \cdot {}_n C_{k-i}}{{}_{m+n} C_k}$$

• R: use the function

phyper(x, m, n, k)


As an example, in a New York State Lotto game, a bettor selects $6$ numbers from $1$ to $59$ (without repetition), and a winning $6$-number combination is later randomly selected. To find the probability that one purchases a $1$ ticket with a $6$-number combination and gets more than $2$ of the winning numbers, one could use the following:

> 1 - phyper(2, 6, 53, 6)
[1] 0.0108641


• Excel: use the function

HYPGEOM.DIST(x,k,m,N,TRUE)

Here again, importantly, The value of $N$ used in this function represents the total number of balls, which differs from the $n$ used in its R-based counterpart discussed above.

The last argument for this function, when $TRUE$, indicates the probability returned should be cumulative. That is to say, it gives the sum $P(0) + P(1) + \cdots + P(k)$.

### Simulating Random Variables following a Hypergeometric Distributions

To simulate numbers randomly chosen from a hypergeometric distribution, such as the count of white balls seen when drawing $k$ balls without replacement from an urn containing $m$ white balls and $n$ black balls ...

• R: use the function

rhyper(nn, m, n, k)


Note, the value $nn$ above indicates how many numbers to generate.

As an example, suppose 20% of a batch of 30 integrated circuit chips are defective. To simulate the number of defective chips found in 10 random samples of size 8, one could use the following:

> rhyper(10, 6, 24, 8)
[1] 2 1 1 0 1 2 1 2 0 4


• Excel: There is no built-in hypergeometric analog to BINOM.INV(), so random numbers following a hypergeometric-distribution can't be generated in the same way.