Data is a collection of values realized by a random variable.
Statistics is the science of planning and conducting studies to collect, organize, summarize, analyze, and draw conclusions from data.
A population is the set of all subjects under study.
A census is the collection of data from every member of the population.
When a population is very large, conducting a census can be difficult, costly, and impractical. As such, we often study a subset of members selected from a population, called a sample.
In statistics one often hopes to generalize from knowledge of sample data to knowledge of a larger population.
As such, we hope that the sample we take is representative of the population. When this doesn't happen, we say that our sample is biased. For example, suppose we were interested in the proportion of Americans that believe abortion is morally wrong. We probably would not want to use 100 people selected from the member list of a local Catholic church as the sample. Such a sample would most likely be strongly biased, as it would fail to represent the views of the entire population
Bias is the bane of statisticians. We do everything we can to avoid it. Our most powerful weapon to combat bias is randomization.
Consider the following excerpt from Intro Stats by Deveaux, Velleman, and Bock, regarding cooking a large pot of soup:
Suppose you add some salt to the pot. If you sample it from the top before stirring, what will happen? With the salt sitting on top, you'll get the misleading idea that the whole pot is salty. If you sample from the bottom, you'll get an equally misleading idea that the whole pot is bland. By stirring, you randomize the amount of salt throughout the pot, making each taste more typical in terms of the amount of salt in the whole pot.
Randomization can protect you against factors that you know are in the data. It can also help protect against factors that you aren't even aware of. Suppose, while you weren't looking, a friend added a handful of peas to the soup. They are down at the bottom of the pot, mixing with the other vegetables. If you don't randomize the soup by stirring, your test spoonful from the top won't have any peas. By stirring in the salt, you also randomize the peas throughout the pot, making your sample taste more typical of the overall pot even though you didn't know the peas were there. So randomizing protects us by giving us a representative sample even over effects we were unaware of...
...It does that by making sure that on average the sample looks like the rest of the population.
Now let's get down to the nitty gritty details of how we use randomization to create a sample. There are many approaches and some potential pitfalls, as the below detail:
A simple random sample is one where every combination (of a given size) of members of a given population has an equal likelihood being chosen as the sample.
Be careful here, simply having each member of the population with an equal likelihood of being selected for the sample is not enough! While one can call this a random sample, there is a risk of bias. For example, suppose we have a population of 100 people, 50 men and 50 women. If we create our sample by flipping a coin and if its heads, we choose 10 men at random, while if its tails, we choose 10 women at random, every member of the population would have an equal likelihood of being in the sample, but every sample produced this way would be incredibly biased since it would consist of only one gender.
There are several specific ways to create a good (unbiased) sample:
A systematic sample is one where the members of the population being sampled are put into some order (with no correlation to what this sample will be used to investigate). A starting position in this ordering is chosen at random and then every kth member in the ordering is selected until one has a sample of an appropriate size.
(I always think of a story I heard once regarding the origin of the word "decimate", when I think of systematic sampling... Supposedly, the word comes from a disciplinary method used by the Romans. When an army performed poorly during battle, the soldiers would be lined up, and then every tenth soldier would be killed! Definitely encourages people to fight harder, don't you think?)
Using a stratified sample is good idea when the variable you are interested in studying is clearly affected by some other (categorical) variable.
For example, suppose a manufacturing company employs 16 managers and 200 laborers. This company wants to ask a sample of 54 of its employees about whether or not they would support a raise in the managers' salaries. Clearly, we want a fair representation of managers in our sample. Too many, and the laborers might think the sample chosen "loaded the dice", so to speak. Too few, and the managers might similarly complain. We want to keep things fair, so 16 managers out of a total of 216 employees is about 7.4%. 7.4% of the sample of 54 should then be managers. This means we need exactly 4 managers and 50 laborers. So we randomly select 4 managers from the group of 16 managers, and 50 laborers from the group of 200 laborers and then put them together to form our sample.
In this example, the variable we were interested in was "whether or nor they supported a raise in the managers' salaries". The (categorical) variable we suspected played a strong role in the answer to that question was "whether they were a manager or laborer". So we split the population into strata, managers and laborers. Then we randomly selected the correct number of individuals from each strata so that their proportions when combined together into a single sample would be representative of the population examined.
In other words, we forced a population representative balance between managers and laborers. This is what stratified sampling is all about, forcing a population representative balance with regard to a variable that we see as affecting the overall outcome of our investigation.
When the population is already broken up into groups (or clusters) that are each representative of the population, randomly selecting some number of these clusters should provide a good sample. This is called cluster sampling.
For example, suppose we wish to determine which candidate in an upcoming national election is more favored by Southern Baptists in the downtown Atlanta area. If we believe all of the Southern Baptist churches in the downtown Atlanta area are relatively similar to one another, then we can simply select two or three churches' members to use as our sample. Each church plays the role of a "cluster", in this case.
Sequence sampling, used frequently in quality control, involves randomly taking a sequence of successive units taken from a production line as the sample.
Another method that gets used a lot, but probably shouldn't, is convenience sampling. This involves just grabbing members of the population that are convenient. Think about the risks involved here, however. There is a serious chance of introducing bias. Consider the researcher interested in the average view of citizens with regard to what is an acceptable number of drinks someone can have and still drive a car effectively. He decides he will set up a stand on the sidewalk and ask passersby what they think -- that's pretty easy and convenient. Only after he collects his data, he turns around and realizes his stand is set up right outside of a bar! Do you think his sample is going to be representative? Probably not.
That said, sometimes a convenience sample is the best we can do.
Descriptive statistics uses data to provide descriptions of a population, either through numerical calculations or graphs or tables.
Inferential statistics makes inferences and predictions about populations based on samples of data taken from those populations.
Keep in mind that samples almost never behave exactly like the population they are meant to represent. So there will almost always be some differences between them.
When these differences are due to the natural variability inherent in choosing different random samples, this is called sampling error. This type of error is not only expected, but actually becomes very useful in making statistical inferences.
When these differences are due to sample data being collected, recorded, or analyzed incorrectly (e.g., selecting a biased sample, using a defective measuring instrument, or transcription errors), we call this nonsampling error. This type of error is much worse -- it can completely invalidate the related statistical results.
Importantly -- as we expect (for good reasons or bad) that samples will never match populations exactly, we consequently can never prove any result about a population by only analyzing samples taken from it.