**Data** is a collection of values realized by a random variable.

**Statistics** is the science of planning and conducting studies to collect, organize, summarize, analyze, and draw conclusions from data.

A **population** is the set of all subjects under study.

A **census** is the collection of data from every member of the population.

When a population is very large, conducting a census can be difficult, costly, and impractical. As such, we often study a subset of members selected from a population, called a **sample**.

In statistics one often hopes to generalize from knowledge of sample data to knowledge of a larger population.

As such, we hope that the sample we take is representative of the population. When this doesn't happen, we say that our sample is biased. For example, suppose we were interested in the proportion of Americans that believe abortion is morally wrong. We probably would not want to use 100 people selected from the member list of a local Catholic church as the sample. Such a sample would most likely be strongly biased, as it would fail to represent the views of the entire population

Bias is the bane of statisticians. We do everything we can to avoid it. Our most powerful weapon to combat bias is randomization.

Consider the following excerpt from *Intro Stats* by Deveaux, Velleman,
and Bock, regarding cooking a large pot of soup:

Suppose you add some salt to the pot. If you sample it from the top before stirring, what will happen? With the salt sitting on top, you'll get the misleading idea that the whole pot is salty. If you sample from the bottom, you'll get an equally misleading idea that the whole pot is bland. By stirring, you

randomizethe amount of salt throughout the pot, making each taste more typical in terms of the amount of salt in the whole pot.Randomization can protect you against factors that you know are in the data. It can also help protect against factors that you aren't even aware of. Suppose, while you weren't looking, a friend added a handful of peas to the soup. They are down at the bottom of the pot, mixing with the other vegetables. If you don't randomize the soup by stirring, your test spoonful from the top won't have any peas. By stirring in the salt, you also randomize the peas throughout the pot, making your sample taste more typical of the overall pot

even though you didn't know the peas werethere.So randomizing protects us by giving us a representative sampleeven over effects we were unaware of......It does that by making sure that

on averagethe sample looks like the rest of the population.

Now let's get down to the nitty gritty details of how we use randomization to create a sample. There are many approaches and some potential pitfalls, as the below detail:

A **simple random sample** is one where every combination (of a given size) of
members of a given population has an equal likelihood being chosen as the
sample.

Be careful here, simply having each member of the population with an equal likelihood of being selected for the sample is not enough! While one can call this a random sample, there is a risk of bias. For example, suppose we have a population of 100 people, 50 men and 50 women. If we create our sample by flipping a coin and if its heads, we choose 10 men at random, while if its tails, we choose 10 women at random, every member of the population would have an equal likelihood of being in the sample, but every sample produced this way would be incredibly biased since it would consist of only one gender.

There are several specific ways to create a good (unbiased) sample:

A **systematic sample** is one where the members of the population being
sampled are put into some order (with no correlation to what this sample
will be used to investigate). A starting position in this ordering is
chosen at random and then every kth member in the ordering is selected
until one has a sample of an appropriate size.

*(I always think of a story I heard once regarding the origin of the word "decimate", when I think of systematic sampling... Supposedly, the word comes from a disciplinary method used by the Romans. When an army performed poorly during battle, the soldiers would be lined up, and then every tenth soldier would be killed! Definitely encourages people to fight harder, don't you think?)*

Using a **stratified sample** is good idea when the variable you are
interested in studying is clearly affected by some other (categorical)
variable.

For example, suppose a manufacturing company employs 16 managers and 200
laborers. This company wants to ask a sample of 54 of its employees about
whether or not they would support a raise in the managers' salaries.
Clearly, we want a *fair* representation of managers in our sample. Too
many, and the laborers might think the sample chosen "loaded the dice", so
to speak. Too few, and the managers might similarly complain. We want to
keep things fair, so 16 managers out of a total of 216 employees is about
7.4%. 7.4% of the sample of 54 should then be managers. This means we
need exactly 4 managers and 50 laborers. So we randomly select 4 managers
from the group of 16 managers, and 50 laborers from the group of 200
laborers and then put them together to form our sample.

In this example, the variable we were interested in was "whether or nor
they supported a raise in the managers' salaries". The (categorical)
variable we suspected played a strong role in the answer to that question
was "whether they were a manager or laborer". So we split the population
into **strata**, managers and laborers. Then we randomly selected the
correct number of individuals from each strata so that their proportions
when combined together into a single sample would be representative of the
population examined.

In other words, we forced a population representative balance between managers and laborers. This is what stratified sampling is all about, forcing a population representative balance with regard to a variable that we see as affecting the overall outcome of our investigation.

When the population is already broken up into groups (or clusters) that
are each representative of the population, randomly selecting some number of these
clusters should provide a good sample. This is called **cluster sampling**.

For example, suppose we wish to determine which candidate in an upcoming national election is more favored by Southern Baptists in the downtown Atlanta area. If we believe all of the Southern Baptist churches in the downtown Atlanta area are relatively similar to one another, then we can simply select two or three churches' members to use as our sample. Each church plays the role of a "cluster", in this case.

**Sequence sampling**, used frequently in quality control, involves
randomly taking a sequence of successive units taken from a production
line as the sample.

Another method that gets used a lot, but probably shouldn't, is **convenience sampling**.
This involves just grabbing members of the population that
are convenient. Think about the risks involved here, however. There is a
serious chance of introducing bias. Consider the researcher interested in
the average view of citizens with regard to what is an acceptable number
of drinks someone can have and still drive a car effectively. He decides
he will set up a stand on the sidewalk and ask passersby what they think
-- that's pretty easy and convenient. Only after he collects his data, he
turns around and realizes his stand is set up right outside of a bar! Do
you think his sample is going to be representative? Probably not.

That said, sometimes a convenience sample is the best we can do.

**Descriptive statistics** uses data to provide descriptions of a population, either through numerical calculations or graphs or tables.

**Inferential statistics** makes inferences and predictions about populations based on samples of data taken from those populations.

Keep in mind that samples almost never behave exactly like the population they are meant to represent. So there will almost always be some differences between them.

When these differences are due to the natural variability inherent in choosing different random samples, this is called **sampling error**. This type of error is not only expected, but actually becomes very useful in making statistical inferences.

When these differences are due to sample data being collected, recorded, or analyzed incorrectly (e.g., selecting a biased sample, using a defective measuring instrument, or transcription errors), we call this **nonsampling error**. This type of error is much worse -- it can completely invalidate the related statistical results.

Importantly -- as we expect (for good reasons or bad) that samples will never match populations exactly, we consequently can never *prove* any result about a population by only analyzing samples taken from it.

- Organize, summarize, and present data
- Generalize from knowledge of sample data to (probably accurate) knowledge of a larger population
- Test hypotheses
- Determine relationships between variables
- Make predictions from existing data

- Quoting statistics based on non-representative samples
- Choosing the "average" value for a sample which most lends itself to your position, when a different "average" value would be more appropriate
- Speaking of changes in a variable in terms of actual values or percentages to either inflate or deflate their importance psychologically. (How happy would you be if your net worth increased by $10,000,000. What if that only represented a 0.3% increase?)
- Using detached statistics like "1/3 fewer carbs" (fewer than what?)
- Implying causal connections between variables without a well-designed experiment to back it up (i.e., "Doctors say that taking lipotrim twice a day ''may reduce'' your weight by up to 30 lbs in the first 2 weeks!")
- Formatting graphs to mislead the eye
- Designing questions to be used on a survey that will bias the results