  ## Exercises - Correlation

1. A medical researcher wishes to see if there is a relationship between prescription drug prices for identical drugs and identical dosages that are prescribed for humans and for animals. Assume the assumptions for the parametric test are met. Make a scatter plot and draw the regression line. Is there a relationship between the prices, at significance level 0.05?

Predict the animal price for a drug that costs 0.75 for humans. $$\begin{array}{r|cccccccc} \hbox{ Drug prices for humans (x)}&0.67&0.64&1.20&0.51&0.87&0.74&0.50&1.22\\ \hbox{Drug prices for animals (y)}&0.13&0.18&0.42&0.25&0.57&0.57&0.49&1.28 \end{array}$$

Regression line: $\hat y = -.179+.838 x$

Null hypothesis: $\rho=0$

Test statistic: $t=2.11\ (r=.653)$

Critical value: $\pm 2.447$

Fail to reject the null hypothesis. Test statistic is not in the rejection region.

There is not enough evidence to support the claim that there is a relationship between human and animal drug prices.

Prediction: $\bar y=.49$

2. A nationwide department store chain wants to determine whether there is a correlation between the amount spent on local advertising by a store and the number of customers who shopped in that store on Black Friday (the day after Thanksgiving's Day). The assumptions are met for the parametric test. Use significance level 0.05.

Predict the number of customers for a store with a $\$1500$advertising budget. $$\begin{array}{r|ccccccccc} \hbox{ Advertising budget in \ (x)}&1300&2000&4000&3500&6500&5000&4400&3500&4000\cr \hbox{Number of Customers(y)}&235&315&450&436&483&575&387&243&412 \end{array}$$ Regression line:$\hat y = 183+.0552 x$Null hypothesis:$\rho=0$Test statistic:$t=3.07\ (r=.757)$Critical value:$\pm 2.365$Reject the null hypothesis. Test statistic is in the rejection region. There is a significant correlation between advertising budget and the number of customers. Prediction:$\hat y=266$3. Use the data below to test the claim at$\alpha=0.05$that there is a correlation between pulse rates and HDL cholesterol level. $$\begin{array}{r|cccccccccccc} \hbox{Subject} &\hbox{A}&\hbox{B}&\hbox{C}&\hbox{D}&\hbox{E}&\hbox{F}&\hbox{G}&\hbox{H}&\hbox{I}&\hbox{J}&\hbox{K}&\hbox{L}\cr \hbox{ Pulse (beats/minutes)}&60&74&86&54&90&80&66&68&68&56&80&62\cr \hbox{HDL (mg/dL)}&44&41&71&41&57&50&60&47&44&33&48&46 \end{array}$$ 1. Draw a scatterplot and graph the regression line. 2. What are the assumptions for the parametric test? (Assume they are met.) 3. Test the claim. 4. Predict the HDL cholesterol for a subject with pulse rate 70 beats per minutes. 5. Find the coefficient of determination. 6. Find the unexplained deviation for Subject A. 1. Regression line:$\hat y = 7.6222 + 0.5812 x$2. Paired, interval/ratio data. Linear relationship. Bivariate normal. 3. Null hypothesis:$\rho=0$Test statistic:$t=2.905\ (r=.6765)$Critical value:$\pm 2.228$Reject the null hypothesis. Test statistic is in the rejection region. There is a significant correlation between between pulse rates and HDL cholesterol level. 4. Prediction:$\hat y=48.3$5.$r^2=.4577$6.$y-\hat y=44-42.5=1.5$4. A study was conducted to determine whether there is a relationship between strength and speed. A sample of 20-year-old males was selected. Each was asked to do push-ups and to run a specific course. The number of push-ups and the time it took to run the course (in seconds) are given below. $$\begin{array}{r|cccccccc} \hbox{Push-ups(x)}&5&8&10&10&11&15&18&23\cr \hbox{Time(y)}&61&65&45&56&62&48&49&50\cr \end{array}$$ 1. Make a scatter plot and draw the regression line. 2. Determine whether there is a significant relationship between the number of push-ups and the course time at 0.05 significance level. Assume the assumptions for the parametric test are met. 3. Predict the course time of a 20-year-old male who can do 18 push-ups. 4. For the point$( 8,65)$, find the explained deviation and the unexplained deviation. 5. Given that the total variation equals 394, find the coefficient of determination, the explained variation, and the unexplained variation . 1. Regression line:$\hat y = 64.0-.761 x$2. Null hypothesis:$\rho=0$Test statistic:$t=-1.79\ (r=-.591)$Critical value:$\pm 2.447$Fail to reject the null hypothesis. Test statistic is not in the rejection region. There is not a significant relationship between the number of push-ups and the course time. 3. Prediction:$\bar y=54.5$4.$\hat y-\bar y=3.4,\ y-\hat y=7.1$5.$r^2=.349,\ \hbox{explained variation}=137.5,\ \hbox{unexplained variation}=256.5$5. The number of calories and the number of milligrams of cholesterol for a random sample of fast-food chicken sandwiches from seven restaurants are shown below. $$\begin{array}{r|ccccccc} \hbox{Restaurant} &\hbox{A}&\hbox{B}&\hbox{C}&\hbox{D}&\hbox{E}&\hbox{F}&\hbox{G}\cr \hbox{ Calories(x)}&390&510&720&300&430&500&440\cr \hbox{Cholesterol(y)}&43&45&80&50&55&52&60 \end{array}$$ 1. Use the parametric test at$\alpha=.05$on the claim that there is a relationship between the variables. (Do not draw the scatterplot yet.) Predict the amount of cholesterol in a 325 calorie chicken sandwich. 2. Draw a scatter plot. Identify the influential point in your scatter plot. Remove this point from the data set and recalculate the test statistic and critical value for the hypothesis test. 3. Test the claim that there is a relationship between the variables with the influential point removed. Predict the amount of cholesterol in a 325 calorie chicken sandwich. 4. Compare the results. What is the effect of the influential point? Should it be included in or excluded from the data set? Why? 1. Null hypothesis:$\rho=0$Test statistic:$t=-2.612\ (r=.7597)$Critical value:$\pm 2.571$Reject the null hypothesis. Test statistic is in the rejection region. There is a significant correlation between calories and cholesterol. Prediction:$\hat y=45$2. Influential point:$(720,80)$3. Null hypothesis:$\rho=0$Test statistic:$t=0.105\ (r=-.0526)$Critical value:$\pm 2.776$Fail to reject the null hypothesis. Test statistic is not in the rejection region. There is not a significant correlation between calories and cholesterol. Prediction:$\bar y=51$4. The influential point makes there appear to be a correlation between the variables when a correlation does not exist. The point should be excluded because it is far from the other data points, so it has too strong an influence on the results. 6. The data shown below was obtained in a study on the number of absences and the final grades of seven randomly selected students from a statistics class. $$\begin{array}{r|ccccccc} \hbox{Number of absences}&6&2&15&9&12&5&8\\ \hbox{Final grade}&82&86&43&74&58&90&78 \end{array}$$ 1. Draw a scatter diagram and the regression line. 2. List the assumptions for the parametric test. 3. What does "bivariate normal" mean? 4. Inspect the scatter diagram before continuing with the test. Describe two things you should look for. 5. Test for a relationship between the number of absences and the final grade at significance level 0.05. 6. Estimate the grade for a student who has missed 25 classes. Explain your reasoning. 7. Estimate the grade for a student who has missed 10 classes. Explain your reasoning. 1. Regression line:$\hat y=102.5-3.6x$2. Paired, interval/ratio data. Linear relationship. Bivariate normal. 3. Bivariate normal: For any fixed$x$value, the associated$y$values are normally distributed, and vice-versa. 4. Check for influential points. Check for a linear relationship. (i.e. Determine that there is no apparent non-linear relationship.) 5. Null hypothesis:$\rho=0$Test statistic:$t=-6.41\ (r=-.944)$Critical value:$\pm 2.57$Reject the null hypothesis. Test statistic is in the rejection region. There is a significant relationship between the number of absences and the final grade. 6. 25 is outside the range of data. We should not use the regression line to make (extrapolate) a prediction. 7. Prediction:$\hat y=66.5$7. A football fan wishes to see how the number of pass attempts (not completions) relates to the number of yards gained for quarterbacks in past NFL season playoff games. The scatter plot was checked and the data was entered into a calculator with the results given below. $$t=2.48,\ \ df=3,\ \ a=468.0,\ \ b=4.20,\ \ r=.8193,\ \ \bar x=97.6,\ \ \bar y=877.4$$ Determine whether there is a significant relationship between the number of pass attempts ($x$) and the yards gained ($y$) at 0.05 significance level.\\ Predict the yards gained for 100 pass attempts. Null hypothesis:$\rho=0$Test statistic:$t=2.48$Critical value:$\pm 3.182$Fail to reject the null hypothesis. Test statistic is not in the rejection region. There is not a significant correlation between the number of pass attempts and the yards gained. Prediction:$\bar y=877.4$8. A random sample of Hall of Fame pitchers' career wins and their total number of strikeouts was used to get the results below. (Assume the scatter plot has been checked.) $$t=4.53,\ \ df=8,\ \ a=-2013,\ \ b=16.6,\ \ r=.8481,\ \ \bar x=266.4,\ \ \bar y=2414.5$$ Determine whether there is a significant relationship between the number of wins ($x$) and the number of strikeouts ($y$) at 0.05 significance level. Predict the number of strikeouts for a pitcher with 200 wins.\\ For the sample point (284, 3192), find the unexplained deviation and the explained deviation. Null hypothesis:$\rho=0$Test statistic:$t=4.53$Critical value:$\pm 2.306$Reject the null hypothesis. Test statistic is in the rejection region. There is a significant correlation between the number of wins and the number of strikeouts. Regression line:$\hat y = -2013+16.6x$Prediction:$\hat y=1307$Unexplained deviation:$ 490.6$, explained deviation:$ 286.9\$