Simulations with R and Excel

A Game With Unusual Dice

You and your friend are bored and decide to play a game of dice. The game of craps comes to mind, but is quickly discarded since you are both familiar with the probabilities involved. Then your friend pulls out a set of three colored dice (one red, one yellow, and one green) from his pocket. These dice are unusual in that they are not numbered in the normal manner. Instead, the numbers on their six sides agree with the below table:

$$\begin{array}{|c|c|c|c|c|c|}\hline \textrm{Red} & 3 & 3 & 3 & 3 & 3 & 6\\\hline \textrm{Yellow} & 2 & 2 & 2 & 5 & 5 & 5\\\hline \textrm{Green} & 1 & 4 & 4 & 4 & 4 & 4\\\hline \end{array}$$

Your friend tells you that they came from an old board game of his -- and he doesn't recall why they are numbered this way. He also doesn't remember the rules of the original game, but suggests the following rules instead. Basically, you each pick a die to roll, and then you roll it, with the larger number winning that roll.

You are suspicious of the apparent simplicity of your friend's game, and wonder if one die rolls higher numbers than the others on average. Your friend senses your distrust and assures you that no die is any better than any other. Doing a quick calculation in your head, you acknowledge that the average of the numbers on each die is 3.5. To calm any lingering fears you have about the matter, your friend offers to let you always choose first the die with which you wish to play, and he will pick from the remaining two dice.

Let $c_i$ be one of the colors: red, yellow, or green. Let $P(c_1,c_2)$ be the probability of your winning the game if you roll the die with color $c_1$ and your friend rolls the die with color $c_2$. For example, $P(red,green)$ is the probability of your winning the game if you roll red and your friend rolls green.

  1. Use a single statement in R to approximate $P(red,green)$ by simulating 1000 rolls -- then do the same for the other 5 possible color combinations.

  2. Use Excel to approximate these same probabilities, again using 1000 rolls for each color combination.

  3. According to your simulations, Which is the "best die" for your friend to choose if you choose to roll red, green, or yellow, respectively?

  4. Does there appear to be a "best die" for you to choose to roll first? Calculate the actual probabilities that were only approximated by your simulations to confirm your answer.

  5. Is this a fair game? Explain.

  6. Suppose instead that you each rolled two dice of the same color, with the larger total winning the roll. Assuming that you still pick your color first, and your friend chooses his color from the remaining two -- is this new game fair? Backup your conclusion with simulations in either R or Excel (your choice), and actual calculations of the probabilities involved. In the case that the game is not fair, how should the player with the advantage choose which die to roll?

Matching Birthdays

  1. Use a single statement in R to approximate the probability of seeing two or more people with the same birthday in a group of 50, by simulating the situation 1000 times. (assume "no leap days")

  2. Now construct a similar simulation in Excel.

  3. Calculate the true probability of seeing two or more people with the same birthday in a group of 50. How far off were your two approximations?

Simulating Cancer Occurrences

A close relative of yours is fighting prostrate cancer. You believe their cancer may be connected to their living close to a paper and pulp mill. Working for Athena Health, a company that specializes in digital medical records, you have access to the medical histories of millions of people. Violating both company policy and probably some federal laws, you decide to examine some of these files to test your theory. For each of the 17 pulp and paper mills in Georgia, you download the medical histories of 15 randomly selected and currently deceased people that lived all of their lives within a 5 mile radius of the mill. You then find the proportion for each sample that were ever diagnosed with prostrate cancer. You have heard that nationally, the probability of a randomly selected person developing prostrate cancer in their lifetime is 13.97%, and believe your suspicions will be validated if you see any of your samples suggest the proportion is significantly higher than this -- say more than 25%.

  1. Use R to simulate the proportions of these 17 samples under an assumption that there is actually no connection between living close to a paper and pulp mill and developing prostrate cancer.

  2. Now use Excel to accomplish the same task.

  3. In each case -- had the results of your simulations been the real data examined, what would the conclusions drawn from that data have been? Is this surprising?

  4. Explain how a simulation like the one you conducted could suggest that there was a connection between paper and pulp mills and developing cancer, when you made the explicit assumption that there was no connection between these two things. What could be changed to reduce the chance of this type of "false positive" result?