The Central Limit Theorem tells us that the distribution of sample means $\overline{x}$, of samples of size $n$ taken from any given population

- becomes more "normal" in shape as $n$ increases;
- has a mean that agrees with the population mean, $\mu$; and
- has a standard deviation equal to $\sigma/\sqrt{n}$, where $\sigma$ is the standard deviation of the population.

In this project, we will construct a population, and then approximate the distributions of sample means for various sample sizes through repeated sampling, so that we can "see" this theorem in action through a sequence of histograms -- as suggested by the below graphic

First, we need a population with which to work. Ideally, it will be "far from normal" so that we can see the transition from a non-normal distribution when $n$ is small to a much more normal one when $n$ is large.

For convenience, one can use the following code to construct a population of 10000 values from 1 to 100 that follows a non-normal distribution.

rprobs = sample(1:5,5) rprobs = rep(16*rprobs,each=20) rprobs = rprobs + 20*runif(100) rprobs = rprobs / sum(rprobs) pop = sample(1:100,size=10000,replace=TRUE,prob=rprobs)

Next, write a function `population.hist(pop)`

that displays a histogram of the population represented by the vector `pop`

that consists of some number of values, each between 1 and 100, inclusive.

Additionally:

- Classes for the histogram should start at $x=0.5$ and be $5$ units wide.
- The bars of the histogram should be colored "skyblue".
- The title of the histogram should be "Population".
- There should be no label on the $x$-axis.

Then, create a second function `sample.means(pop,sample.size,n,title,show.overlay)`

that draws `n`

samples of size `sample.size`

from the population `pop`

, computes their means, and displays a histogram of these means.

Additionally:

- Classes for the histogram should start at $x=0.5$ and be 5 units wide.
- The bars of the histogram should be colored "green".
- The title of this histogram should be taken from the
`title`

argument - There should be no label on the $x$-axis.
- If the
`show.overlay`

argument supplied is`TRUE`

, a normal curve centered at the population mean $\mu$ with standard deviation $\sigma/\sqrt{n}$, where $\sigma$ is the population's standard deviation, should be drawn on top of the histogram (scaled vertically, so as to approximate the histogram's shape). Further, a blue vertical line that extends from the bottom to the top of the histogram should mark the position $x = \mu$, and a red horizontal line segment that extends from the aforementioned blue line to the right, with length $\sigma/\sqrt{n}$ should be drawn to indicate the spread of the normal curve drawn. The height of this red line segment should be such that it terminates at a point on the normal curve drawn. - If the
`show.overlay`

argument is`FALSE`

, the normal curve and the blue and red lines mentioned above should not be displayed. - When the overlay of the normal curve is drawn, the top of the curve should always be visible.

The following observations may help in doing the above:

- The $y$-range for the plot can be specified by the
`ylim`

parameter. For example, if`ylim=c(0,1000)`

is used as an argument to the`hist()`

function, $y$-coordinates from 0 to 1000 can be seen in the plot. - Plots and histograms need not be displayed. One can suppress their display with an argument of
`plot=FALSE`

. - Histograms can be assigned to variables, and a vector of the frequency counts associated with their bars can be accessed, as suggested by the following:
h = hist(v); bar.counts = h$counts;

Finally, the

`par()`

function in R can be used to show multiple plots/histograms simultaneously in a grid. Use this, as shown in the below code, to test your work. An image similar to the one shown at the top of this page should result.par(mfcol=c(3,3)) population.hist(pop) sample.means(pop,2,1000,"Sample Means (n=2)",FALSE) sample.means(pop,2,1000,"Sample Means (n=2)",TRUE) population.hist(pop) sample.means(pop,4,1000,"Sample Means (n=4)",FALSE) sample.means(pop,4,1000,"Sample Means (n=4)",TRUE) population.hist(pop) sample.means(pop,15,1000,"Sample Means (n=15)",FALSE) sample.means(pop,15,1000,"Sample Means (n=15)",TRUE)