AP Statistics – Unit 2: Exploring Two-Variable Data

2.1 Introducing Statistics: Are Variables Related?

In Unit 1, we explored one variable at a time. Now we ask a more interesting question: Are two variables related? This is the foundation of two-variable (bivariate) statistics.

Bivariate Data: Data that involves two variables measured on the same individuals. We're interested in whether knowing one variable helps us predict or understand the other.

Types of Two-Variable Relationships

Variable Types Display Method Analysis Method
Categorical + Categorical Two-way table (contingency table) Conditional distributions, segmented bar graphs
Quantitative + Quantitative Scatterplot Correlation, linear regression
Categorical + Quantitative Side-by-side boxplots, parallel dotplots Compare distributions (covered in Unit 1)

Explanatory and Response Variables

Explanatory Variable (x): The variable we think might explain or predict changes in another variable. Also called the independent variable or predictor.

Response Variable (y): The variable we think might be affected or predicted by the explanatory variable. Also called the dependent variable or outcome.

Examples:

• Studying hours (x) → Test score (y)

• Age (x) → Blood pressure (y)

• Fertilizer amount (x) → Plant height (y)

The explanatory variable goes on the x-axis; the response variable goes on the y-axis.

⚠️ Important Distinction:

Identifying explanatory and response variables does NOT mean one causes the other! It simply indicates which variable we're using to predict or explain the other. Causation requires a controlled experiment, not just an observed relationship.

Exam Tip: The AP exam often asks you to identify explanatory and response variables. Ask yourself: "Which variable do I think might influence the other?" That's the explanatory variable.

2.2 Representing Two Categorical Variables

When both variables are categorical, we use a two-way table (also called a contingency table) to display the data. This table shows the frequency for every combination of categories.

Two-Way Tables

Two-Way Table: A table that displays the count (frequency) of individuals falling into each combination of categories for two categorical variables.

Example: Survey of 400 students on phone type and grade level

iPhone Android Other Total
Freshman 45 35 20 100
Sophomore 55 30 15 100
Junior 60 28 12 100
Senior 70 22 8 100
Total 230 115 55 400

Key Terminology

Term Definition Example from Table
Joint Frequency The count in a single cell — individuals in both categories. 45 freshmen with iPhones
Marginal Frequency The row or column totals — count for one variable, ignoring the other. 230 total iPhone users; 100 total freshmen
Grand Total The total count of all individuals (bottom-right cell). 400 students total

Segmented (Stacked) Bar Graphs

Segmented Bar Graph: A bar graph where each bar represents one category of a variable, and the bar is divided into segments showing the breakdown by a second variable. Can show counts or percentages.

Segmented Bar Graph: Phone Type by Grade 0% 25% 50% 75% 100% Fresh Soph Junior Senior iPhone Android Other

🎯 When to use segmented bar graphs:
• Use percentages (not counts) when comparing groups of different sizes
• All bars should reach 100% when using percentages
• Makes it easy to compare the distribution of one variable across categories of another

Exam Tip: Segmented bar graphs should use the SAME scale for all bars. If using percentages, each bar should go to 100%. If comparing groups of different sizes, always use percentages.

2.3 Statistics for Two Categorical Variables

To determine if two categorical variables are associated, we compare conditional distributions. If the distributions differ across categories, the variables may be related.

Types of Distributions

Marginal Distribution

Definition: The distribution of one variable alone, ignoring the other variable.

Found in the margins (row/column totals) of a two-way table.

Example: "What percent of all students use iPhones?" → 230/400 = 57.5%

Conditional Distribution

Definition: The distribution of one variable given a specific category of the other variable.

Calculated within a row or column of the table.

Example: "What percent of seniors use iPhones?" → 70/100 = 70%

Calculating Conditional Distributions

Conditional Distribution of Y given X:

P(Y | X) =
Joint frequency (cell count)
Marginal frequency (row or column total)

Example: Conditional distribution of phone type for SENIORS:

Phone Type Count Conditional %
iPhone 70 70/100 = 70%
Android 22 22/100 = 22%
Other 8 8/100 = 8%
Total 100 100%

Association Between Categorical Variables

Association: Two categorical variables are associated if the conditional distribution of one variable differs across categories of the other variable.

No Association (Independence): The conditional distributions are approximately the same across all categories — knowing one variable doesn't help predict the other.

Testing for Association

Compare the conditional distributions:

Different conditional distributions

→ Variables ARE associated

Same conditional distributions

→ Variables are NOT associated

Is grade level associated with phone type?

Compare conditional distributions:

• Freshmen: 45% iPhone, 35% Android, 20% Other

• Seniors: 70% iPhone, 22% Android, 8% Other

Conclusion: The conditional distributions are different (iPhone use increases with grade level), so there IS an association between grade level and phone type.

⚠️ Association ≠ Causation:

Finding an association between two categorical variables does NOT mean one causes the other. There could be lurking variables or the relationship could be coincidental.

Exam Tip: When describing association, always compare SPECIFIC percentages from conditional distributions. Don't just say "there's a difference" — quantify it!

2.4 Representing the Relationship Between Two Quantitative Variables

When both variables are quantitative, we use a scatterplot to visualize the relationship. Scatterplots reveal patterns, trends, and unusual observations.

Scatterplots

Scatterplot: A graph that displays the relationship between two quantitative variables. Each point represents one individual, with coordinates (x, y).

x-axis: Explanatory variable
y-axis: Response variable

Scatterplot: Study Hours vs. Test Score 0 2 4 6 8 10 50 60 70 80 90 Hours Studied Test Score

Describing a Scatterplot: D.O.F.S.

D.O.F.S.: The 4 Things to Describe

Direction — Outliers — Form — Strength

Always describe ALL FOUR when analyzing a scatterplot!

Feature Description Options
Direction The overall trend of the relationship Positive: As x increases, y increases
Negative: As x increases, y decreases
None: No clear trend
Outliers Points that fall outside the overall pattern Identify any unusual points that don't fit the trend
Form The shape of the relationship Linear: Points follow a straight line
Nonlinear: Curved pattern (quadratic, exponential, etc.)
Strength How closely points follow the pattern Strong: Points cluster tightly
Moderate: Some scatter
Weak: Lots of scatter

Complete Scatterplot Description:

"The scatterplot shows a strong, positive, linear relationship between hours studied and test score. As students study more hours, their test scores tend to increase. There are no apparent outliers."

✓ Includes direction (positive), form (linear), strength (strong), outliers (none), and context!

🎯 Remember: Always describe the relationship in context. Don't just say "positive linear" — say "as hours studied increases, test scores tend to increase."

Exam Tip: When describing a scatterplot, you MUST address direction, form, and strength. Forgetting any of these costs points on free-response questions.

2.5 Correlation

The correlation coefficient (r) gives us a numerical measure of the strength and direction of a linear relationship between two quantitative variables.

The Correlation Coefficient (r)

Correlation (r): A numerical measure of the strength and direction of the linear relationship between two quantitative variables.

r =
1
n − 1
Σ
(xᵢ − x̄)
sₓ
·
(yᵢ − ȳ)
sᵧ

Note: You don't need to calculate r by hand — use your calculator!

Properties of Correlation

Property Explanation
Always between −1 and +1 −1 ≤ r ≤ +1
r = +1: Perfect positive linear relationship
r = −1: Perfect negative linear relationship
r = 0: No linear relationship
Sign indicates direction r > 0: Positive relationship (y increases as x increases)
r < 0: Negative relationship (y decreases as x increases)
Magnitude indicates strength |r| close to 1 = strong linear relationship
|r| close to 0 = weak or no linear relationship
Unitless r has no units — it doesn't depend on the units of x or y
Order doesn't matter r(x, y) = r(y, x) — switching x and y doesn't change r
Not resistant Outliers can greatly affect r

Interpreting Correlation Values

−1 Perfect Negative −0.5 Moderate 0 None +0.5 Moderate +1 Perfect Positive

|r| > 0.8

Strong

0.5 < |r| < 0.8

Moderate

|r| < 0.5

Weak

⚠️ Critical Limitations of Correlation:

1. r only measures LINEAR relationships. A strong curved relationship can have r ≈ 0!

2. r is NOT resistant. A single outlier can dramatically change r.

3. Correlation does NOT imply causation! Just because two variables are correlated doesn't mean one causes the other.

4. r doesn't tell you the slope. Two scatterplots can have the same r but very different slopes.

Calculator Commands (TI-83/84)

STAT → CALC → 8:LinReg(a+bx)

This gives you r (correlation), along with the regression equation. Make sure "DiagnosticOn" is enabled to see r and r².

Exam Tip: Always say "linear" when interpreting r. Don't say "r = 0.85 shows a strong relationship" — say "r = 0.85 shows a strong positive LINEAR relationship."

2.6 Linear Regression Models

When a scatterplot shows a linear pattern, we can fit a regression line to predict the response variable (y) from the explanatory variable (x).

The Regression Line Equation

Regression Line (Line of Best Fit): A line that describes how the response variable (y) changes as the explanatory variable (x) changes.

ŷ = a + bx

ŷ (y-hat) = predicted value of y
a = y-intercept (predicted y when x = 0)
b = slope (change in ŷ for each 1-unit increase in x)

🎯 Why ŷ instead of y?
We use ŷ (y-hat) to emphasize that these are predicted values, not actual observed values. The actual y-values may differ from ŷ due to natural variation.

Interpreting the Slope (b)

Slope Interpretation Template:

"For each additional [1 unit of x], the predicted [y variable] increases/decreases by [|b|] [units of y]."

Example: ŷ = 45 + 5x, where x = hours studied and y = test score

Slope interpretation: "For each additional hour studied, the predicted test score increases by 5 points."

✓ Includes: "each additional," "predicted," context, and units!

Interpreting the Y-Intercept (a)

Y-Intercept Interpretation Template:

"When [x variable] is 0, the predicted [y variable] is [a] [units of y]."

Example: ŷ = 45 + 5x

Y-intercept interpretation: "When a student studies 0 hours, their predicted test score is 45 points."

⚠️ Y-Intercept Warning:

The y-intercept often doesn't make practical sense! If x = 0 is outside the range of your data, the y-intercept is just a mathematical artifact. For example:

• "A person with 0 years of education has a predicted salary of −$5,000" — Nonsense!

When x = 0 is not meaningful, say: "The y-intercept has no practical interpretation in this context because [x = 0 is outside the range of the data / x = 0 is not possible]."

Using the Regression Line for Prediction

Example: Using ŷ = 45 + 5x, predict the test score for a student who studies 6 hours.

ŷ = 45 + 5(6) = 45 + 30 = 75 points

Exam Tip: When interpreting slope, you MUST include the words "predicted" (not "actual"), "for each additional" (not "for every"), and proper context/units. These exact phrases earn points!

2.7 Residuals

A residual measures how far an actual data point is from the predicted value on the regression line. Residuals help us assess how well the line fits the data.

What is a Residual?

Residual: The difference between the observed (actual) value of y and the predicted value of y.

Residual = y − ŷ = Observed − Predicted
Residual = Observed − Predicted ŷ = a + bx + residual (actual) (predicted) − residual (actual) (predicted) x y

Interpreting Residuals

Residual Value Meaning Interpretation
Positive (+) Actual value > Predicted value
Point is ABOVE the line
The model UNDERESTIMATED this y-value
Negative (−) Actual value < Predicted value
Point is BELOW the line
The model OVERESTIMATED this y-value
Zero (0) Actual value = Predicted value
Point is ON the line
The model predicted this y-value exactly

Example: Using ŷ = 45 + 5x, a student who studied 6 hours scored 82 points.

Predicted: ŷ = 45 + 5(6) = 75

Actual: y = 82

Residual = 82 − 75 = +7

Interpretation: "The student's actual score was 7 points higher than the model predicted."

Properties of Residuals

🎯 Key Properties:

• The sum of all residuals = 0 (positive and negative residuals balance out)

• The mean of residuals = 0

• Residuals are measured in the same units as y

• Smaller residuals = better predictions = better fit

Exam Tip: When asked to interpret a residual, always say whether the actual was HIGHER or LOWER than predicted, and by how much. Include context!

2.8 Least Squares Regression

Among all possible lines we could draw through a scatterplot, the least squares regression line (LSRL) is the "best" line — it minimizes the sum of the squared residuals.

What Makes the LSRL "Best"?

Least Squares Regression Line (LSRL): The unique line that minimizes the sum of squared residuals:

Minimize: Σ(y − ŷ)² = Σ(residuals)²

This means the LSRL makes the squared vertical distances from points to the line as small as possible overall.

Formulas for the LSRL

LSRL Equation: ŷ = a + bx

Slope (b):

b = r ·
sᵧ
sₓ

Y-Intercept (a):

a = ȳ − b·x̄

Where r = correlation, sₓ = std dev of x, sᵧ = std dev of y, x̄ = mean of x, ȳ = mean of y

Important Property: The LSRL Passes Through (x̄, ȳ)

🎯 The LSRL always passes through the point (x̄, ȳ) — the point of averages. This is guaranteed by the formula a = ȳ − b·x̄.

This means: if you plug in x = x̄ into the equation, you get ŷ = ȳ.

The Coefficient of Determination (r²)

r² (R-squared): The proportion of the variation in y that is explained by the linear relationship with x.

r² = (correlation)²

Interpreting r²

Template: "[r² × 100]% of the variation in [y variable] is explained by the linear relationship with [x variable]."

Example: If r = 0.9, then r² = 0.81

Interpretation: "81% of the variation in test scores is explained by the linear relationship with hours studied."

This means 19% of the variation in test scores is due to other factors not captured by hours studied.

Interpretation
r² close to 1 Most of the variation in y is explained by x — the line fits well
r² close to 0 Little variation in y is explained by x — the line fits poorly
r² = 0.64 64% of variation in y is explained by x; 36% is unexplained

Calculator Commands (TI-83/84)

STAT → CALC → 8:LinReg(a+bx) L1, L2

Output includes: a (y-intercept), b (slope), r² (coefficient of determination), r (correlation)

Remember: Turn on DiagnosticOn first! (2nd → 0 → DiagnosticOn)

Exam Tip: The AP exam often asks you to interpret r². Always use the phrase "variation in [y] explained by" — not "correlation" or "relationship."

2.9 Analyzing Departures from Linearity

Before trusting a regression line, we must check if a linear model is appropriate. Residual plots are the key diagnostic tool for detecting problems with linear models.

Residual Plots

Residual Plot: A scatterplot of the residuals (y-axis) against the explanatory variable x or the predicted values ŷ (x-axis).

Residual plots help us see patterns that aren't obvious in the original scatterplot.

What to Look For in Residual Plots

✓ GOOD: Random Scatter

No pattern, evenly scattered around 0. Linear model is appropriate!

✗ BAD: Curved Pattern

U-shape or curve = nonlinear relationship. Linear model is NOT appropriate!

✗ BAD: Fan Shape

Spread increases/decreases = non-constant variance. Predictions less reliable at extremes.

Outliers and Influential Points

Term Definition Effect
Outlier A point with an unusually large residual — far from the regression line in the y-direction. May or may not affect the LSRL much, depending on its x-value.
High Leverage Point A point with an extreme x-value — far from the mean of x. Has the POTENTIAL to strongly influence the LSRL.
Influential Point A point that, if removed, would substantially change the LSRL (slope, intercept, or r). Usually a high leverage point that doesn't follow the pattern of other points.
Types of Unusual Points Outlier (large residual) High Leverage (extreme x, follows pattern) Influential (would change line)

Extrapolation: A Dangerous Practice

⚠️ Extrapolation: Using a regression line to predict y for x-values outside the range of the original data.

This is dangerous because:

• The relationship may not continue beyond the observed data

• Predictions become increasingly unreliable farther from the data

• The pattern could change (become curved, level off, etc.)

Example: A regression model predicts test scores based on hours studied, using data where students studied 1-8 hours.

Interpolation (OK): Predicting score for 5 hours studied — within the data range.

Extrapolation (Risky): Predicting score for 15 hours studied — outside the data range. The relationship might not hold!

Checklist: Is a Linear Model Appropriate?

🎯 Before using a linear model, verify:

✓ The scatterplot shows a linear pattern (not curved)

✓ The residual plot shows random scatter around 0 (no pattern)

✓ The spread in the residual plot is roughly constant (no fan shape)

✓ There are no highly influential points distorting the model

Exam Tip: The AP exam loves asking you to interpret residual plots. A "good" residual plot has NO pattern — just random scatter. Any pattern indicates the linear model is not appropriate.

Unit 2 Key Takeaways

Two-way tables: Joint, marginal, and conditional distributions

Association = different conditional distributions across categories

Scatterplots: Describe with D.O.F.S. (Direction, Outliers, Form, Strength)

Correlation (r): Measures strength & direction of LINEAR relationship

LSRL: ŷ = a + bx minimizes sum of squared residuals

r² = proportion of variation in y explained by x

Residual plots: Random scatter = linear model is appropriate

Residual = y − ŷ  |  b = r(sᵧ/sₓ)  |  a = ȳ − bx̄  |  LSRL passes through (x̄, ȳ)

End of Unit 2 Study Guide.

← Unit 1 Unit 2 Quiz →