## Exercises - Simple Linear Regression and One-Way Analysis of Variance

1. Bivariate data relating the number of classroom absences and the number of points earned prior to the final exam in the course was gathered from a random sampling of students. Plot a scatter diagram. Find the correlation coefficient and the simple regression line. Place this line on the scatter diagram. Is the relationship significant? A student with 7 absences would probably have how many points in the course? A student with 20 absences would probably have how many points?

Notes: covariance is $-625.62857$, variance for $x$ is $16.114286$, variance for $y$ is $25905.171$, mean for $x$ is $4.4$, mean for $y$ is $432.8$. The correlation coefficient is $r = -0.9683$. The regression line is $y = -38.8x + 603.6$. $H_0 : \rho = 0$ was rejected with a test statistic of $t = -13.976$ since the test criterion gives a value of $t(13) = 3.012$ for $\alpha = 0.01$. The relationship is a highly significant inverse one; i.e., the more days missed, the lower the points earned for the grade. A student with 7 absences is predicted to have 332 points (substitute the value 7 for $x$ in the prediction equation). Twenty absences cannot be predicted since this value is outside the scope of the sample data.

2. Physicians feel that there is a relationship between a person's age and the number of days they are sick that year. A random sample of 20 was obtained to this end comparing age and days sick. Plot a scatterdiagram. Find the correlation coefficient and give alpha levels for significance. Find the regression line and place it on the scatter diagram. How many days in a year would you predict that a 60 year old person will be sick?

The correlation coefficient is $r = 0.685$. The regression line is $y = 0.257x + 5.973$. The null hypothesis, $H_0 : \rho = 0$, was rejected since the test statistic is $t = 3.99$ and the critical value for $\alpha = 0.01$ is $t(18) = 2.878$. The relationship is a highly significant direct one; i.e., the older a person is, the more days the person is sick. A 60 year old person would probably be sick about 22 days.

3. Two different teaching methods are being compared to the traditional (lecture) method of teaching calculus. Method one (computer) uses the computer for homework, exploratory projects, drill on concepts, and testing (by random computer generated tests). Method two (project) uses a graphing calculator, weekly projects to guide students through concepts, class discussion, and student presentations. Random samples of student exam scores from each of the three groups (lecture, computer, and project) are to be compared. Assume the following:

• These groups of students had never been exposed to calculus previously;
• All three groups of students were equivalent initially so that any difference among groups could be attributed to the teaching strategy employed;
• The sample came from normally distributed populations with homogeneity of variance already met; and
• The same final exam was given and grades from this final represent at least interval data.

According to final exam scores seen in the three samples, which approach, if any does the best job of teaching calculus? Construct an ANOVA table, state the null hypothesis, give the test statistic clearly, give your conclusion and interpret. Remember to follow your ANOVA with appropriate tests if there is a significant difference.

ANOVA Table: $$\begin{array}{l|r|r|r|r|} & \textrm{Df} & \textrm{Sum Sq} & \textrm{Mean Sq} & \textrm{F value}\\\hline \textrm{Group (or Factor)} & 2 & 488.601 & 244.301 & 1.186\\\hline \textrm{Error (or Residual)} & 18 & 3709.208 & 206.067 & \\\hline \textrm{Total} & 20 & 417.81 & & \\\hline \end{array}$$ $F(2,18) = 3.5546$ for $\alpha = 0.05$ is the critical value. Fail to reject the null hypothesis $H_0$ : $\mu_1 = \mu_2 = \mu_3$. There is not a significant difference among the three approaches to teaching calculus. No additional tests are needed since no significant difference was found.

4. An experiment was conducted to compare the wearing qualities of three types of paint. Ten point specimens were tested for each paint type and the number of hours until visible abrasion was apparent was recorded. Assume that the variances are not significantly different, that the distributions are approximately normal, that the measures are numerical. Is there evidence to indicate a difference in the three plant types? Give all appropriate information. Each group has 10 readings with the following statistics: $$\begin{array}{l|c|c} & s & \bar{x}\\\hline \textrm{Type 1} & 158.196 & 229.6\\\hline \textrm{Type 2} & 147.874 & 309.9\\\hline \textrm{Type 3} & 196.818 & 427.8\\\hline \end{array}$$ $$SS_{\textrm{between}} = 198772 \quad SS_{\textrm{within}} = 770671$$

ANOVA Table: $$\begin{array}{l|r|r|r|r|} & \textrm{Df} & \textrm{Sum Sq} & \textrm{Mean Sq} & \textrm{F value}\\\hline \textrm{Group (or Factor)} & 2 & 198772 & 99386 & 3.4819\\\hline \textrm{Error (or Residual)} & 27 & 770671 & 28543.37 & \\\hline \textrm{Total} & 29 & 969443 & & \\\hline \end{array}$$ $F(2,27) = 3.3541$ for $\alpha = 0.05$, therefore reject the null hypothesis that there are no differences in the three types of paints: i.e., reject $H_0 : \mu_1 = \mu_2 = \mu_3$. Since there is at least one significant difference, follow with Scheffe tests. There is a significant difference between paint type 1 and paint type 3. Paint type 3 takes significantly longer (in hours) to show abrasion than paint type 1. One should probably purchase type 3.

5. It is believed that there is a relationship between intelligence as measured by IQ scores on the Otis Lennin Test (OLT), with a population mean of 100 and a standard deviation of 15, and the achievement as measured by the PSAT, with a population mean of 95 and a standard deviation of 15. Bivariate data relating OLT and PSAT scores was obtained from a random sample of 10th graders taking the PSAT test at a local high school.

1. Make a scatter diagram
2. Find the correlation coefficient
3. Find the regression line and place the line on the scatter diagram
4. Determine the significance by hypothesis testing techniques
5. A person with a PSAT score of 60 should have an IQ of approximately what? Is this a reasonable model to use for this prediction? Explain clearly.
6. A person with a PSAT score of 95 (the mean) should have an IQ score of approximately what? Is this a reasonable model to use for this prediction? Explain clearly.
(a) keep domain for $x$ between 70 and 115, and domain for $y$ between 90 and 128. (b) $r = 0.88$ (approx.) for $r^2 = 0.78$ as coefficient of determination. (c) $y = 0.86x + 26.42$. (d) $H_0 : \rho = 0$ with test statistic $t = 5.2759$. Critical value, $t(8) = 3.355$ at $\alpha = 0.01$. Reject $H_0$. The correlation regression model is significant. $p = 0.00075$. (e) PSAT score of 60 is outside the boundaries of the data set. This data cannot help predict for a PSAT score of 60. (f) This is a significant model and the PSAT score of 95 is within the boundaries of the sample data set. This is a reasonable model to predict: $y = 0.86(95) + 26.42 = 108$ (approximately). Note that the value of 108 is above the mean for the IQ test -- this type of result can happen.

6. You have been asked to determine if one brand of fertilizer is better than another. Seedlings are obtained. Each grouping of seedlings has the same conditions except for the type fertilizer used. Growth (in inches) for fertilizer A, fertilizer B, and fertilizerC are recorded after 2 months. Answer using the following statistical model:

1. Create an ANOVA table, giving all important information (null hypotheses, test statistic, critical value, conclusion, interpretation). Remember to check for outliers. If there is an outlier, discard it prior to creating the ANOVA table since including an outlier could hinder evaluation. Make sure that the variances are not significantly different.
2. If there is at leas one significant difference between fertilizers, perform the appropriate difference of means tests for independent samples.
3. Which fertilizer would you suggest using?
4. For Type A there is an outlier of 3.0 that needs to be discarded and the mean and standard deviation recalculated. Type A (with outlier): mean is $5.233$, $s = 0.723$. Type A (without outlier): mean is $5.393$, $s = 0.389$. Type B : mean $5.827$, $s = 0.406$. Type C: mean $4.233$, $s = 0.385$. Variances are not significantly different. (Check this with the $F$-ratio, remember to square the standard deviation and place the larger number in the numerator.) ANOVA Table: $$\begin{array}{l|r|r|r|r|} & \textrm{Df} & \textrm{Sum Sq} & \textrm{Mean Sq} & \textrm{F value}\\\hline \textrm{Group (or Factor)} & 2 & 20.297 & 10.149 & 65.506^* \\\hline \textrm{Error (or Residual)} & 41 & 6.352 & 0.155 & \\\hline \textrm{Total} & 43 & 26.649 & & \\\hline \end{array}$$ ${}^*p = 1.7 \times 10^{-13}$, highly significant

7. For a random sample of 12 people, their initial weights and weight lost from a diet for one month (both in pounds) are recorded. (a) Draw a scatter diagram. Give the equation for the regression line and draw the regression line on the scatter diagram. (b) Is there a significant linear correlation between the two variables? Give the critical value for the test. (c) What is the best predicted weight loss for an individual with an initial weight of 165 pounds?

$\hat{y} = -18.15 + 0.2285x$; $H_0 : \rho = 0; t = 2.19$; Critical value for $\alpha = 0.05$, $t(10) = \pm 2.228$; Fail to reject the null hypothesis. There is not a significant linear correlation, therefore do not use the equation to predict. The best prediction would be the mean of $y$ which is $22.67$.