As we consider the total **variation** of the $y$ values seen in a scatterplot, given by

We consequently might wonder what fraction of the total variation might be explained by the linear model itself, and what fraction is still unexplained.

This "explained variation" is quickly seen to be

$$\sum_i (\widehat{y}_i - \overline{y})^2$$To find a nice, tight expression for the "unexplained variation", consider the following:

Clearly, this is true:

$$(y_i - \overline{y}) = (y_i - \widehat{y}_i) + (\widehat{y}_i - \overline{y})$$Squaring both sides and summing over all $i$, we then have:

$$\sum_i (y_i-\overline{y})^2 = \sum_i (y_i-\widehat{y}_i)^2 + \sum_i (\widehat{y}_i - \overline{y})^2 + \sum_i 2(\widehat{y}_i - \overline{y})(y_i - \widehat{y}_i)$$We claim the last term is zero. Here's why:

Remembering

$$\widehat{y}_i = mx_i + b, \quad \overline{y} = m\overline{x} + b, \quad \textrm{and} \quad m = \displaystyle{\frac{\sum (x_i - \overline{x})(y_i - \overline{y})}{\sum (x_i - \overline{x})^2}}$$Note,

$$\widehat{y}_i - \overline{y} = m(x_i - \overline{x}), \quad \textrm{ and thus}$$ $$y_i - \widehat{y}_i = (y_i - \overline{y}) - (\widehat{y}_i - \overline{y}) = (y_i - \overline{y}) - m(x_i - \overline{x})$$Consequently,

$$\begin{array}{rcl} \displaystyle{\sum_i 2(\widehat{y}_i - \overline{y})(y_i - \widehat{y}_i)} &=& \displaystyle{2m \sum_i (x_i-\overline{x})(y_i-\widehat{y}_i)}\\\\ &=& \displaystyle{2m \sum (x_i-\overline{x})((y_i - \overline{y}) - m(x_i - \overline{x}))}\\\\ &=& 2m \left( \sum (x_i-\overline{x})((y_i - \overline{y}) - \sum_i (x_i - \overline{x})^2 \displaystyle{\frac{\sum (x_j - \overline{x})(y_j - \overline{y})}{\sum (x_j - \overline{x})^2}} \right)\\\\ &=& 2m(0)\\\\ &=& 0 \end{array}$$With our claim shown, we can now say that

$$\sum_i (y_i-\overline{y})^2 = \sum_i (y_i-\widehat{y}_i)^2 + \sum_i (\widehat{y}_i - \overline{y})^2$$Certainly the total variation should be the sum of the explained variation and the unexplained variation, so the first term on the right is referred to as the **unexplained variation**.
.

Given the arguments above, we can find the proportion of variation explained by the linear model (i.e., the regression line) by finding the quotient:

$$\frac{\textrm{explained variation}}{\textrm{total variation}} = \frac{\sum_i (\widehat{y}_i-\overline{y})^2}{\sum_i (y_i-\overline{y})^2}$$but this has a much simpler (and more amazing) form...

Recall that the best-fit line is given by $\widehat{y} = mx + b$, where $m = \displaystyle{\frac{s_{xy}}{s_x^2} = r \frac{s_y}{s_x}}$ and $b = \overline{y} - m\overline{x}$ $$\begin{array}{rcl} \displaystyle{\frac{\sum_i (\widehat{y}_i - \overline{y})^2}{\sum_i (y_i - \overline{y})^2}} &=& \displaystyle{\frac{\sum_i ((mx_i + b) - (m\overline{x} + b))^2}{\sum_i (y_i - \overline{y})^2}}\\\\ &=& \displaystyle{\frac{\sum_i (mx_i - m\overline{x})^2}{\sum_i (y_i - \overline{y})^2}}\\\\ &=& \displaystyle{\frac{m^2\sum_i (x_i - \overline{x})^2}{\sum_i (y_i - \overline{y})^2}}\\\\ &=& \displaystyle{\left( r \frac{s_y}{s_x} \right)^2 \cdot \frac{s_x^2}{s_y^2}}\\\\ &=& r^2 \end{array}$$

Thus $r^2$, which we call the **coefficient of determination**, determines the fraction of the variation explained by the linear model!