Descriptive Statistics in R and Excel

Measures of Center

In R, finding means and medians can be found with similarly named functions, as seen below.

> data = c(1,2,2,20,4,5,7,3)
> mean(data)
[1] 5.5
> median(data)
[1] 3.5

There is no built-in function for identifying the mode, although one can write a custom function to this end. Importantly, avoid naming this custom function "mode" (although "Mode" with a capital "M" is ok, as R is case-sensitive). This is due to the fact that there is another function named "mode" that is used for something else in R. An example of such a function is given below, along with an example of its application:

> data = c(1,2,2,20,4,5,7,3)
> Mode = function(x) {}
> Mode = function(x) {
+   ux = unique(x)
+   ux[which.max(tabulate(match(x,ux)))]
+ }
> Mode(data)
[1] 2

In Excel, the mean, median, and mode of a given range $R$ can be quickly calculated with the following functions, respectively: $$AVERAGE(R), \quad MEDIAN(R), \quad \textrm{and} \quad MODE(R)$$

Measures of Spread

In R, one can calculate the variance, standard deviation, and interquartile range (IQR) with functions var(), sd(), and IQR(), as seen in the examples below:

> data = c(1,2,2,20,4,5,7,3)
> var(data)
[1] 38
> sd(data)
[1] 6.164414
> IQR(data)
[1] 3.5

Of course, if one needed the population variance or population standard deviation, one could multiply the variance by the appropriate constant:

> data = c(1,2,2,20,4,5,7,3)
> n = length(data)
> n
[1] 8
> pop.var = var(data) * (n-1) / n
> pop.var
[1] 33.25
> pop.sd = sqrt(pop.var)
> pop.sd
[1] 5.766281

The range of values in a data set can also be easily found with the min(), max(), and range(), as shown below:

> data = c(1,2,2,20,4,5,7,3)
> min(data)
[1] 1
> max(data)
[1] 20
> range(data)
[1]  1 20

In Excel, similar statistics related to the spread of data in a given range $R$ are given by the following functions or expressions:

StatisticFunction
Population VarianceVAR.P(R)
Sample VarianceVAR.S(R)
Population Standard DeviationSTDEV.P(R)
Sample Standard DeviationSTDEV.S(R)
Inter-quartile Range (IQR)QUARTILE(R,3) - QUARTILE(R,1)
MaximumMAX(R)
MinimumMIN(R)



Identifying Outliers

To identify any data that fall more than 1.5 IQR's below $Q_1$ or above $Q_3$, one can use the quantile() function, which gives the cutoff data value under which some percentage of the data lie.

As an example of how this function works, suppose we have the following variable defined:

> grades = c(86,92,100,93,89,95,79,98,68,62,71,75,88,86,93,81,100,86,96,52,15)
We can find $Q_1$ with quantile(grades,0.25), since 25% of the data lies below $Q_1$.

Similarly, quantile(grades,0.75) will give $Q_3$.

Using these, we can quickly identify any outliers with the following:

> lower.bound = quantile(grades,0.25)-1.5*IQR(grades)
> upper.bound = quantile(grades,0.75)+1.5*IQR(grades)
> c(lower.bound,upper.bound)
25% 75% 
 48 120 
> grades[grades < lower.bound | grades > upper.bound]
[1] 15
Thus, $15$ is an outlier as it falls below $Q_1 = 48$.

Of course, one can also identify outliers under appropriate circumstances by finding those data values that fall outside of 3 standard deviations from the mean. This is done in a straight-forward way, as shown in the example below:

> grades = c(86,92,100,93,89,95,79,98,68,62,71,75,88,86,93,81,100,86,96,52,15)
> lower.bound = mean(grades) - 3*sd(grades)
> upper.bound = mean(grades) + 3*sd(grades)
> c(lower.bound,upper.bound)
[1]  21.58463 140.79632
> grades[grades < lower.bound | grades > upper.bound]
[1] 15