## Review Exercises B1

1. Consider the following data set: $$\begin{array}{ccccc} 35 & 39 & 43 & 43 & 43\\ 44 & 46 & 46 & 46 & 48\\ 48 & 49 & 50 & 52 & 53\\ 54 & 54 & 55 & 56 & 60\\ 62 & 63 & 64 & 66 & 78 \end{array}$$

1. Construct a frequency distribution for the data with $6$ classes showing class limits and class boundaries. Use this to draw a frequency histogram for the data. Correctly label your graph.
2. What percentage of the data lies within 1.7 standard deviations of the mean? Show that this result is consistent with Chebyshev's theorem.
3. Find the mode(s) of the above data.
4. Determine whether there are any outliers in the data set.
5. If there are most two outliers, remove them from the data. Is the distribution of the remaining data significantly skewed? Show your reasoning. Also, is the data approximately normal? Explain.
1. Limits        Boundaries        Frequency
35 - 42       34.5 - 42.5       2
43 - 50       42.5 - 50.5       11
51 - 58       50.5 - 58.5       6
59 - 66       58.5 - 66.5       5
67 - 74       66.5 - 74.5       0
75 - 82       74.5 - 82.5       1


2. $\overline{x} = 51.9$; $s = 9.7$; $(51.9-1.7(9.7),51.9+1.7(9.7)) = (35.41,68.39)$; $23/25 = 92\%$ within $1.7$ standard deviations; Chebyshev claims at least $1-1/1.7^2 \doteq 65.4\% \lt 92\%$, so this is consistent.

3. Modes: $43, 46$

4. $Q_1 - 1.5 \cdot IQR = 25.5$; $Q_3 + 1.5 \cdot IQR = 77.5$; $78$ is an outlier.

5. After removing outlier $78$, we have $\overline{x} = 50.8$; $s = 8.2$; $Q_2 = 49.5$. So $I = 3(\overline{x} - Q_2)/s = 0.475$ whose absolute value is less than $1$. Hence, no significant skew. Yes, the distribution is approximately normal (after removal of the outlier).

2. A nursery grows plants of a particular species from seed. After 6 months, the seedlings average $22.6$ cm in height with a standard deviation of $2.5$ cm. Assume that the heights are normally distributed.

1. Find the probability that a randomly selected seedling has height less than 18 cm.

2. Find the height interval for the middle $80\%$ of seedings.

3. Sample $A$, which consists of 25 seedlings, has a mean height of at least $23.7$ cm. Find the probability of this occurring for a random sample of such seedlings.

4. Sample $B$, which also consists of 25 seedlings, has a mean height of at least $23.1$ cm. Find the probability of this occurring for a random sample of such seedlings.

5. Samples $A$ and $B$ above, were grown using fertilizer $A$ and $B$, respectively. Use the probabilities above to determine the effectiveness of each fertilizer. Explain your reasoning.

1. $P(x \lt 18) \doteq P(z \lt -1.84) \doteq 0.0329$

2. $z$-scores for middle $80\%$ are $\pm1.28$, which correspond to $x$-values of $19.4$ and $25.8$.

3. CLT applies. Distribution of sample means is normal as original population of heights are normally distributed. Noting that $z = (23.7-22.6)/(2.5/\sqrt{25})$, $P(\overline{x} \ge 23.7) = P(z \ge 2.20) = 1 - 0.9861 = 0.0139$.

4. CLT applies. Distribution of sample means is normal as original population of heights are normally distributed. Noting that $z = (23.1-22.6)/(2.5/\sqrt{25})$, $P(\overline{x} \ge 23.1) = P(z \ge 1.00) = 1 - 0.8413 = 0.1587$.

5. Fertilizer $A$ seems to be effective. The probability of a random sample of unfertilized seedlings having a mean height of at least 23.1 is less than 5%, so the greater height is likely due to the fertilizer. We can't tell whether fertilizer $B$ is effective or not. The greater height of this sample could be due to sampling error. It is not unusual to get a mean height of at least 23.1 ($0.1587 \gt 0.05$)

3. Fill in the blank:

1. Gender is an example of $\underline{\hspace{1in}}$ level data.

2. Temperature (in $C^{\circ}$) is an example of $\underline{\hspace{1in}}$ level data.

3. An example of ordinal level data is $\underline{\hspace{1in}}$.

4. For ratio level data that is distributed symmetrically, we use the $\underline{\hspace{1in}}$ for the measure of center and the $\underline{\hspace{1in}}$ to measure variation.

5. For skewed interval level data, we usually use the $\underline{\hspace{1in}}$ for the measure of center and the $\underline{\hspace{1in}}$ to measure variation.

6. For nominal level data, use the $\underline{\hspace{1in}}$ for the measure of center.

1. nominal
2. interval
4. mean; variance or standard deviation
5. median; interquartile range (IQR)
6. mode
4. Explain the difference between an observational study and an experiment.

In an experiment, we apply some treatment to the subjects of the study and observe the results. In an observational study, we only observe characteristics present in the subjects -- we never treat/modify the subjects in any way.

5. Describe an example from history where a sample may not have been random. Discuss the problem with the sampling and describe the consequences.

(answers vary) good examples include Literary Digest (1936), Chicago Tribune (1948), and Draft Lottery (1970).

6. A middle school principal wants to sample 30 of his students, where each grade (6 through 8) and gender is equally represented in the sample. Describe how he might generate this sample. Assume he has access to records of all the students enrolled at the school. What type of sampling is this?

This is stratified sampling. Use school records to randomly choose 5 each of the 6 categories: 6th grade male, 7th grade male, 8th grade male, 6th grade female, 7th grade female, and 8th grade female.

7. Cholesterol levels in men of a certain age follow a normal distribution with mean $178.1$ mg/100 mL and standard deviation $40.7$ mg/100 mL.

1. For this population, find the probability that a randomly selected man has a cholesterol level greater than $260$

2. For this population, find the probability that a randomly selected man has a cholesterol level between $170$ and $200$

3. Find the probability that the average cholesterol level of 9 randomly selected men from this population is between $170$ and $200$

4. The highest 3% of cholesterol levels (but no more than that) are above what cholesterol level?

1. $\displaystyle{z_{260} = \frac{260 - 178.1}{40.7} \doteq 2.012}$, so we find $P(z_{260} \gt 2.012) \doteq 0.0221$

2. $z_{170} \doteq -0.1990$ and $z_{200} \doteq 0.5381$, so we find $P(-0.1990 \lt z \lt 0.5381) \doteq 0.2836$

3. Central Limit Theorem applies with $n=9$. $\mu = 178.1$, while $\sigma = \displaystyle{\frac{40.7}{\sqrt{9}} \doteq 13.5667}$. So $z_{170} \doteq -0.5971$ and $z_{200} \doteq 1.6222$. Hence, we find $$P(-0.5971 \lt z \lt 1.6222) \doteq 0.6724$$

4. We need the $z$-score with $0.03$ in area to its right, which is approximately $1.8808$. So the value we seek is this many standard deviations above the mean. Hence, the cholesterol level we want is $x = 178.1 + 1.8808 \cdot 40.7 \doteq 254.6486$

8. Find the indicated $z$-scores:

1. one where there is $0.67$ in area to its left

2. one where there is $0.996$ in area to its right

1. $0.4399$
2. $-2.6521$

9. Men's weights follow a normal distribution with a mean of 172 pounds and a standard deviation of 29 pounds.

1. What is the probability that a randomly selected man carrying a 20 lb bag collectively weighs more than 195 lbs.

2. If an airplane is full of 213 men (and no women or children), each with a 20 lb bag, what is the probability that the total weight is greater than 41535 lbs (the weight limit for the airplane)?

1. With the bag the mean weight is $\mu = 192$. The standard deviation remains the same. $z_{195} \doteq 0.1034$. So $P(z \gt 0.1034) \doteq 0.4588$

2. If the total weight is 41535 lbs, the average weight of the 213 men is 195 lbs. Central limit theorem applies. $\mu = 192$, $\displaystyle{\sigma = \frac{29}{\sqrt{213}} \doteq 1.9870}$. Thus $z_{195} = 2.1282$. So the probability of exceeding the weight limit is $P(z \gt 2.1282) \doteq 0.0167$.