AP Statistics โ€“ Unit 3: Collecting Data

3.1 Introducing Statistics: Do the Data We Collected Tell the Truth?

Units 1 and 2 focused on analyzing data. But where does data come from? How we collect data determines what conclusions we can draw. Poor data collection can lead to misleading or completely wrong conclusions.

The Big Question: Can we trust our data to tell us the truth about the population we're interested in?

The answer depends on how the data was collected.

Two Ways to Collect Data

Observational Study

Definition: Researchers observe and record data without attempting to influence the responses.

  • No treatment is imposed
  • Researchers simply watch and measure
  • Can show association but NOT causation

Example: Surveying people about their exercise habits and health outcomes.

Experiment

Definition: Researchers deliberately impose treatments on subjects to observe their effects.

  • A treatment is actively applied
  • Researchers manipulate variables
  • CAN establish causation (if well-designed)

Example: Randomly assigning people to exercise programs and measuring health outcomes.

The Golden Rule of Statistics

Only a well-designed EXPERIMENT can establish CAUSE and EFFECT.

Observational studies can only show association โ€” there may be lurking variables!

Why Can't Observational Studies Show Causation?

Confounding Variable (Lurking Variable): A variable that is associated with both the explanatory variable and the response variable, making it impossible to determine which is actually causing the effect.

Classic Example: Studies show that people who drink wine have better heart health than those who don't.

Can we conclude wine CAUSES better heart health? NO!

Confounding variables:

  • Income (wine drinkers may be wealthier โ†’ better healthcare)
  • Diet (wine drinkers may eat healthier overall)
  • Lifestyle (wine drinkers may exercise more)

Without randomly assigning people to drink wine or not, we can't separate the effect of wine from these other factors.

Exam Tip: When asked "Can we conclude that X causes Y?", always check: Is this an experiment or observational study? If observational, the answer is NO โ€” mention confounding variables!

3.2 Introduction to Planning a Study

Before collecting data, you need a clear plan. Good study design ensures your results are valid and useful.

Key Questions in Planning

Question Why It Matters
What is the population? Defines who/what you want to learn about. Your conclusions only apply to this group.
What is the sample? The subset you actually collect data from. Must represent the population well.
What are the variables? What characteristics are you measuring? Are they categorical or quantitative?
How will data be collected? Survey? Observation? Experiment? The method affects what conclusions you can draw.

Population vs. Sample

POPULATION (entire group of interest) SAMPLE (subset we study) Inference: Generalize from sample to population

Census vs. Sample Survey

Census

Collects data from every individual in the population.

Pros: Complete information, no sampling error

Cons: Expensive, time-consuming, often impractical or impossible

Example: U.S. Census (every 10 years)

Sample Survey

Collects data from a subset of the population.

Pros: Faster, cheaper, practical for large populations

Cons: Sampling error (results vary from sample to sample)

Example: Political polls (sample ~1000 voters)

๐ŸŽฏ Key Insight: A well-designed sample can give accurate results for the whole population! The key is random selection โ€” not sample size alone.

Exam Tip: If the population is small enough and accessible, a census is ideal. But for most real-world situations, a well-designed random sample is more practical and still provides reliable results.

3.3 Random Sampling and Data Collection

The key to good sampling is randomness. Random sampling ensures every individual has a known chance of being selected, which allows us to make valid inferences about the population.

Why Random Sampling?

Random Sampling: A sampling method where every member of the population has a known, non-zero probability of being selected.

Benefits:

  • Eliminates selection bias
  • Allows generalization to the population
  • Enables calculation of sampling error

Types of Random Sampling

Method How It Works When to Use
Simple Random Sample (SRS) Every possible sample of size n has an equal chance of being selected. Every individual has an equal chance of being in the sample. When population is fairly homogeneous and you have a complete list (sampling frame).
Stratified Random Sample Divide population into homogeneous groups (strata), then take an SRS from each stratum. When population has distinct subgroups and you want to ensure representation of each group.
Cluster Sample Divide population into groups (clusters), randomly select some entire clusters, then sample ALL individuals in chosen clusters. When population is spread out geographically or a complete list is unavailable.
Systematic Sample Select every kth individual from a list after a random starting point. When you have an ordered list and no periodic pattern exists in the data.

Stratified vs. Cluster Sampling

Stratified vs. Cluster Sampling STRATIFIED Sample FROM each group Stratum A Stratum B Sample from BOTH strata CLUSTER Sample ENTIRE groups Cluster 1 Cluster 2 โœ— ALL from selected cluster only More precise estimates but requires list of all individuals More practical/cheaper but more variability in results

๐ŸŽฏ Memory Trick:

Stratified: Strata are homogeneous (similar within) โ€” sample from ALL strata

Cluster: Clusters are heterogeneous (diverse within) โ€” sample ENTIRE clusters

Think: Stratified = "a little from each group" | Cluster = "all from some groups"

Example: Surveying high school students about lunch preferences

SRS: Number all 2000 students, randomly select 100.

Stratified: Divide by grade (Fresh/Soph/Jr/Sr), randomly select 25 from each grade.

Cluster: Randomly select 5 homeroom classes, survey ALL students in those classes.

Systematic: Get alphabetical list, randomly pick starting point, select every 20th student.

Exam Tip: The AP exam loves asking you to identify sampling methods and explain WHY a particular method is appropriate. Know the differences!

3.4 Potential Problems with Sampling

Even with the best intentions, sampling can go wrong. Understanding sources of bias helps you design better studies and critically evaluate others' research.

What is Bias?

Bias: A systematic tendency to over- or under-estimate the true population parameter. Bias causes results to consistently miss the target in the same direction.

Key Point: Bias is NOT about sample size. A biased method produces biased results no matter how large the sample!

Types of Bias in Sampling

Type of Bias Description Example
Undercoverage Some groups in the population are left out of the sampling frame (the list used to select the sample). Phone surveys miss people without phones; online surveys miss people without internet access.
Nonresponse Bias Selected individuals can't be contacted or refuse to participate, and they differ systematically from responders. People with strong opinions may be more likely to respond to surveys about controversial topics.
Response Bias Respondents give inaccurate answers due to question wording, interviewer influence, or social desirability. People underreport unhealthy behaviors; leading questions push respondents toward certain answers.
Voluntary Response Bias People choose whether to participate, so those with strong opinions are overrepresented. Online polls, call-in surveys, product reviews โ€” tend to attract extreme opinions.
Convenience Sampling Selecting individuals who are easy to reach rather than using random selection. Surveying only students in your class, interviewing people at a mall.

Bias vs. Variability

Bias vs. Variability (Target Analogy) High Bias Low Variability Low Bias High Variability Low Bias โœ“ Low Variability โœ“

Bias

Systematic error โ€” results consistently miss the true value in ONE direction.

Fix: Better sampling method (random selection)

Cannot be fixed by larger sample size!

Variability

Random error โ€” results scatter around the true value (some high, some low).

Fix: Increase sample size

Larger samples reduce variability!

โš ๏ธ Sources of Response Bias:

Leading questions: "Don't you agree that..." pushes toward a particular answer

Social desirability: People give answers they think are socially acceptable

Interviewer effect: Characteristics of the interviewer influence responses

Question order: Earlier questions can influence later responses

Recall bias: People may not accurately remember past events

Exam Tip: When identifying bias, be SPECIFIC. Don't just say "biased" โ€” name the TYPE of bias and explain HOW it affects results (over/underestimate what?).

3.5 Introduction to Experimental Design

Experiments are the gold standard for establishing cause and effect. A well-designed experiment isolates the effect of a treatment by controlling for other variables.

Key Vocabulary

Term Definition
Experimental Units The individuals (people, animals, objects) on which the experiment is performed. Called subjects when they are people.
Treatment A specific condition applied to the experimental units. Experiments compare two or more treatments.
Factor An explanatory variable that is manipulated in the experiment. Each factor can have multiple levels.
Response Variable The outcome measured to assess the effect of the treatment.
Control Group A group that receives no treatment (or a placebo/standard treatment) for comparison.
Placebo A fake treatment (like a sugar pill) that looks like the real treatment but has no active effect.

Example: Testing whether a new fertilizer increases plant growth

Experimental units: 100 tomato plants

Factor: Fertilizer type (2 levels: new fertilizer, standard fertilizer)

Treatments: (1) New fertilizer, (2) Standard fertilizer (control)

Response variable: Plant height after 6 weeks

The Three Principles of Experimental Design

C.R.C. โ€” The Three Principles

C

Control

R

Randomization

C

Replication

Principle What It Means Why It Matters
Control Keep all variables other than the treatment the same for all groups. Use a control group for comparison. Ensures that differences in the response are due to the treatment, not other factors.
Randomization Randomly assign experimental units to treatment groups. Creates groups that are roughly equivalent, eliminating systematic bias and balancing unknown confounders.
Replication Use enough experimental units in each group to detect real differences. Reduces the effect of chance variation and increases the reliability of results.

Blinding

Single-blind: Either the subjects OR the evaluators don't know which treatment each subject received.

Double-blind: NEITHER the subjects NOR the evaluators know which treatment each subject received.

Purpose: Prevents bias from expectations affecting results (placebo effect, observer bias).

๐ŸŽฏ Why Use a Placebo?

The placebo effect is a real phenomenon where people improve simply because they believe they're receiving treatment. A placebo allows us to separate the psychological effect of "being treated" from the actual effect of the treatment.

Exam Tip: When describing an experimental design, always address ALL THREE principles: How will you control variables? How will you randomly assign? How many units in each group?

3.6 Selecting an Experimental Design

Different experimental situations call for different designs. The right design maximizes your ability to detect treatment effects while controlling for variability.

Completely Randomized Design (CRD)

Completely Randomized Design: All experimental units are randomly assigned to treatments with no restrictions or grouping.

All Subjects
โ†’
Random Assignment
โ†’
Treatment Groups
โ†’
Compare Outcomes
Completely Randomized Design All 60 Subjects Random Group 1 (n=30) Treatment A Group 2 (n=30) Treatment B Compare Response Variable

Randomized Block Design

Randomized Block Design: Subjects are first divided into blocks (groups of similar individuals), then randomly assigned to treatments within each block.

Purpose: Controls for a variable that might affect the response (like gender, age, or location).

Randomized Block Design (Blocking by Gender) Block 1: Males (n = 30) Block 2: Females (n = 30) Random Trt A (15 M) Trt B (15 M) Trt A (15 F) Trt B (15 F) All Trt A (15M + 15F) All Trt B (15M + 15F) Compare Outcomes

๐ŸŽฏ Blocking vs. Stratifying:

Stratified sampling (observational): Divide population into strata, take SRS from each โ†’ better representation

Blocking (experiments): Divide subjects into blocks, randomly assign within each โ†’ controls for block variable

Same idea, different contexts! Both group similar individuals together.

Matched Pairs Design

Matched Pairs Design: A special type of block design where each block contains only 2 units matched on important characteristics. One unit gets each treatment.

Alternative: Each subject serves as their own control (receives both treatments in random order).

Example 1 (Matched pairs): Testing a new running shoe

Each runner wears the new shoe on one foot and standard shoe on the other (randomly assigned). Compare comfort ratings for each runner.

Example 2 (Self as control): Testing a sleep aid

Each subject takes the sleep aid for one week and placebo for one week (order randomized). Compare sleep quality for each subject.

Choosing a Design

Design When to Use
Completely Randomized When experimental units are fairly homogeneous, or no obvious blocking variable exists.
Randomized Block When there's a known variable (gender, age, location) that might affect the response and you want to control for it.
Matched Pairs When you can pair similar subjects or when each subject can receive both treatments.
Exam Tip: When designing an experiment, clearly state: (1) the experimental units, (2) how you'll randomly assign to groups, (3) what treatments you'll compare, and (4) what response you'll measure.

3.7 Inference and Experiments

The goal of experiments is to determine if an observed effect is real (caused by the treatment) or just due to chance. This section bridges data collection and statistical inference.

Statistical Significance

Statistically Significant: An observed effect is statistically significant if it is unlikely to have occurred by chance alone.

In other words: the difference between groups is too large to be explained by random variation.

Example: Testing a new drug vs. placebo

โ€ข Drug group: 75% improvement

โ€ข Placebo group: 72% improvement

Question: Is this 3% difference real, or could it be due to random chance in who was assigned to each group?

Statistical tests (covered in later units) help us answer this!

What Conclusions Can We Draw?

Two Types of Conclusions

๐Ÿ”ฌ Cause & Effect

Requires: Randomized experiment

"The treatment CAUSED the effect."

๐ŸŒ Generalization

Requires: Random sample from population

"Results apply to the whole population."

Study Design Can Conclude Causation? Can Generalize to Population?
Randomized experiment with random sample from population โœ“ YES โœ“ YES
Randomized experiment with volunteers โœ“ YES โœ— NO (only to similar individuals)
Observational study with random sample โœ— NO (confounding) โœ“ YES
Observational study with convenience sample โœ— NO โœ— NO

The Scope of Inference

Scope of Inference Matrix Random Assignment? YES NO Random Selection? NO YES โœ“ Causation โœ“ Generalize Best scenario! โœ“ Causation โœ— No Generalize Most experiments โœ— No Causation โœ“ Generalize Most surveys โœ— No Causation โœ— No Generalize Limited conclusions

โš ๏ธ Common Mistake:

Students often confuse random assignment (experiments) with random selection (sampling).

Random assignment โ†’ Creates equivalent treatment groups โ†’ Allows causal conclusions

Random selection โ†’ Creates representative sample โ†’ Allows generalization to population

Exam Tip: The AP exam frequently asks "What conclusions can be drawn?" Always consider BOTH: (1) Was there random assignment? (causation) and (2) Was there random selection? (generalization)

Unit 3 Key Takeaways

Observational studies show ASSOCIATION; Experiments can show CAUSATION

Random sampling methods: SRS, Stratified, Cluster, Systematic

Bias types: Undercoverage, Nonresponse, Response, Voluntary

Experimental principles: Control, Randomization, Replication (CRC)

Designs: Completely Randomized, Randomized Block, Matched Pairs

Scope of inference depends on random assignment AND random selection

Random Assignment โ†’ Causation | Random Selection โ†’ Generalization

End of Unit 3 Study Guide.

โ† Unit 2 Unit 3 Quiz โ†’
๐ŸŽ‰

Nice work finishing Unit 3!

Ready to test yourself? Create a free account to take Quiz 3!

Sign Up Free & Take Quiz 3 โ†’

Free account = this quiz unlocked. No credit card needed.