Explained Variation and the Coefficient of Determination

As we consider the total variation of the $y$ values seen in a scatterplot, given by

$$\sum_i (y_i - \overline{y})^2$$ we should realize that if we have a significant correlation, the stronger that correlation is between our $x$ and $y$, the more we can account for the "spread" of our $y$ values by simply acknowleding that they lie close to the best-fit line -- which, by virtue of the significant correlation has a of non-zero slope, and thus "spreads the $y$-values out".

We consequently might wonder what fraction of the total variation might be explained by the linear model itself, and what fraction is still unexplained.

This "explained variation" is quickly seen to be

$$\sum_i (\widehat{y}_i - \overline{y})^2$$

To find a nice, tight expression for the "unexplained variation", consider the following:

Clearly, this is true:

$$(y_i - \overline{y}) = (y_i - \widehat{y}_i) + (\widehat{y}_i - \overline{y})$$

Squaring both sides and summing over all $i$, we then have:

$$\sum_i (y_i-\overline{y})^2 = \sum_i (y_i-\widehat{y}_i)^2 + \sum_i (\widehat{y}_i - \overline{y})^2 + \sum_i 2(\widehat{y}_i - \overline{y})(y_i - \widehat{y}_i)$$

We claim the last term is zero. Here's why:


$$\widehat{y}_i = mx_i + b, \quad \overline{y} = m\overline{x} + b, \quad \textrm{and} \quad m = \displaystyle{\frac{\sum (x_i - \overline{x})(y_i - \overline{y})}{\sum (x_i - \overline{x})^2}}$$


$$\widehat{y}_i - \overline{y} = m(x_i - \overline{x}), \quad \textrm{ and thus}$$ $$y_i - \widehat{y}_i = (y_i - \overline{y}) - (\widehat{y}_i - \overline{y}) = (y_i - \overline{y}) - m(x_i - \overline{x})$$


$$\begin{array}{rcl} \displaystyle{\sum_i 2(\widehat{y}_i - \overline{y})(y_i - \widehat{y}_i)} &=& \displaystyle{2m \sum_i (x_i-\overline{x})(y_i-\widehat{y}_i)}\\\\ &=& \displaystyle{2m \sum (x_i-\overline{x})((y_i - \overline{y}) - m(x_i - \overline{x}))}\\\\ &=& 2m \left( \sum (x_i-\overline{x})((y_i - \overline{y}) - \sum_i (x_i - \overline{x})^2 \displaystyle{\frac{\sum (x_j - \overline{x})(y_j - \overline{y})}{\sum (x_j - \overline{x})^2}} \right)\\\\ &=& 2m(0)\\\\ &=& 0 \end{array}$$

With our claim shown, we can now say that

$$\sum_i (y_i-\overline{y})^2 = \sum_i (y_i-\widehat{y}_i)^2 + \sum_i (\widehat{y}_i - \overline{y})^2$$

Certainly the total variation should be the sum of the explained variation and the unexplained variation, so the first term on the right is referred to as the unexplained variation. .

The Coefficient of Determination

Given the arguments above, we can find the proportion of variation explained by the linear model (i.e., the regression line) by finding the quotient:

$$\frac{\textrm{explained variation}}{\textrm{total variation}} = \frac{\sum_i (\widehat{y}_i-\overline{y})^2}{\sum_i (y_i-\overline{y})^2}$$

but this has a much simpler (and more amazing) form...

Recall that the best-fit line is given by $\widehat{y} = mx + b$, where $m = \displaystyle{\frac{s_{xy}}{s_x^2} = r \frac{s_y}{s_x}}$ and $b = \overline{y} - m\overline{x}$ $$\begin{array}{rcl} \displaystyle{\frac{\sum_i (\widehat{y}_i - \overline{y})^2}{\sum_i (y_i - \overline{y})^2}} &=& \displaystyle{\frac{\sum_i ((mx_i + b) - (m\overline{x} + b))^2}{\sum_i (y_i - \overline{y})^2}}\\\\ &=& \displaystyle{\frac{\sum_i (mx_i - m\overline{x})^2}{\sum_i (y_i - \overline{y})^2}}\\\\ &=& \displaystyle{\frac{m^2\sum_i (x_i - \overline{x})^2}{\sum_i (y_i - \overline{y})^2}}\\\\ &=& \displaystyle{\left( r \frac{s_y}{s_x} \right)^2 \cdot \frac{s_x^2}{s_y^2}}\\\\ &=& r^2 \end{array}$$

Thus $r^2$, which we call the coefficient of determination, determines the fraction of the variation explained by the linear model!