2.1 Introducing Statistics: Are Variables Related?
In Unit 1, we explored one variable at a time. Now we ask a more interesting question: Are two variables related? This is the foundation of two-variable (bivariate) statistics.
Bivariate Data: Data that involves two variables measured on the same individuals. We're interested in whether knowing one variable helps us predict or understand the other.
Types of Two-Variable Relationships
| Variable Types | Display Method | Analysis Method |
|---|---|---|
| Categorical + Categorical | Two-way table (contingency table) | Conditional distributions, segmented bar graphs |
| Quantitative + Quantitative | Scatterplot | Correlation, linear regression |
| Categorical + Quantitative | Side-by-side boxplots, parallel dotplots | Compare distributions (covered in Unit 1) |
Explanatory and Response Variables
Explanatory Variable (x): The variable we think might explain or predict changes in another variable. Also called the independent variable or predictor.
Response Variable (y): The variable we think might be affected or predicted by the explanatory variable. Also called the dependent variable or outcome.
Examples:
• Studying hours (x) → Test score (y)
• Age (x) → Blood pressure (y)
• Fertilizer amount (x) → Plant height (y)
The explanatory variable goes on the x-axis; the response variable goes on the y-axis.
⚠️ Important Distinction:
Identifying explanatory and response variables does NOT mean one causes the other! It simply indicates which variable we're using to predict or explain the other. Causation requires a controlled experiment, not just an observed relationship.
2.2 Representing Two Categorical Variables
When both variables are categorical, we use a two-way table (also called a contingency table) to display the data. This table shows the frequency for every combination of categories.
Two-Way Tables
Two-Way Table: A table that displays the count (frequency) of individuals falling into each combination of categories for two categorical variables.
Example: Survey of 400 students on phone type and grade level
| iPhone | Android | Other | Total | |
|---|---|---|---|---|
| Freshman | 45 | 35 | 20 | 100 |
| Sophomore | 55 | 30 | 15 | 100 |
| Junior | 60 | 28 | 12 | 100 |
| Senior | 70 | 22 | 8 | 100 |
| Total | 230 | 115 | 55 | 400 |
Key Terminology
| Term | Definition | Example from Table |
|---|---|---|
| Joint Frequency | The count in a single cell — individuals in both categories. | 45 freshmen with iPhones |
| Marginal Frequency | The row or column totals — count for one variable, ignoring the other. | 230 total iPhone users; 100 total freshmen |
| Grand Total | The total count of all individuals (bottom-right cell). | 400 students total |
Segmented (Stacked) Bar Graphs
Segmented Bar Graph: A bar graph where each bar represents one category of a variable, and the bar is divided into segments showing the breakdown by a second variable. Can show counts or percentages.
🎯 When to use segmented bar graphs:
• Use percentages (not counts) when comparing groups of different sizes
• All bars should reach 100% when using percentages
• Makes it easy to compare the distribution of one variable across categories of another
2.3 Statistics for Two Categorical Variables
To determine if two categorical variables are associated, we compare conditional distributions. If the distributions differ across categories, the variables may be related.
Types of Distributions
Marginal Distribution
Definition: The distribution of one variable alone, ignoring the other variable.
Found in the margins (row/column totals) of a two-way table.
Example: "What percent of all students use iPhones?" → 230/400 = 57.5%
Conditional Distribution
Definition: The distribution of one variable given a specific category of the other variable.
Calculated within a row or column of the table.
Example: "What percent of seniors use iPhones?" → 70/100 = 70%
Calculating Conditional Distributions
Conditional Distribution of Y given X:
Example: Conditional distribution of phone type for SENIORS:
| Phone Type | Count | Conditional % |
|---|---|---|
| iPhone | 70 | 70/100 = 70% |
| Android | 22 | 22/100 = 22% |
| Other | 8 | 8/100 = 8% |
| Total | 100 | 100% |
Association Between Categorical Variables
Association: Two categorical variables are associated if the conditional distribution of one variable differs across categories of the other variable.
No Association (Independence): The conditional distributions are approximately the same across all categories — knowing one variable doesn't help predict the other.
Testing for Association
Compare the conditional distributions:
Different conditional distributions
→ Variables ARE associated
Same conditional distributions
→ Variables are NOT associated
Is grade level associated with phone type?
Compare conditional distributions:
• Freshmen: 45% iPhone, 35% Android, 20% Other
• Seniors: 70% iPhone, 22% Android, 8% Other
Conclusion: The conditional distributions are different (iPhone use increases with grade level), so there IS an association between grade level and phone type.
⚠️ Association ≠ Causation:
Finding an association between two categorical variables does NOT mean one causes the other. There could be lurking variables or the relationship could be coincidental.
2.4 Representing the Relationship Between Two Quantitative Variables
When both variables are quantitative, we use a scatterplot to visualize the relationship. Scatterplots reveal patterns, trends, and unusual observations.
Scatterplots
Scatterplot: A graph that displays the relationship between two quantitative variables. Each point represents one individual, with coordinates (x, y).
• x-axis: Explanatory variable
• y-axis: Response variable
Describing a Scatterplot: D.O.F.S.
D.O.F.S.: The 4 Things to Describe
Direction — Outliers — Form — Strength
Always describe ALL FOUR when analyzing a scatterplot!
| Feature | Description | Options |
|---|---|---|
| Direction | The overall trend of the relationship |
Positive: As x increases, y increases Negative: As x increases, y decreases None: No clear trend |
| Outliers | Points that fall outside the overall pattern | Identify any unusual points that don't fit the trend |
| Form | The shape of the relationship |
Linear: Points follow a straight line Nonlinear: Curved pattern (quadratic, exponential, etc.) |
| Strength | How closely points follow the pattern |
Strong: Points cluster tightly Moderate: Some scatter Weak: Lots of scatter |
Complete Scatterplot Description:
"The scatterplot shows a strong, positive, linear relationship between hours studied and test score. As students study more hours, their test scores tend to increase. There are no apparent outliers."
✓ Includes direction (positive), form (linear), strength (strong), outliers (none), and context!
🎯 Remember: Always describe the relationship in context. Don't just say "positive linear" — say "as hours studied increases, test scores tend to increase."
2.5 Correlation
The correlation coefficient (r) gives us a numerical measure of the strength and direction of a linear relationship between two quantitative variables.
The Correlation Coefficient (r)
Correlation (r): A numerical measure of the strength and direction of the linear relationship between two quantitative variables.
Note: You don't need to calculate r by hand — use your calculator!
Properties of Correlation
| Property | Explanation |
|---|---|
| Always between −1 and +1 | −1 ≤ r ≤ +1 r = +1: Perfect positive linear relationship r = −1: Perfect negative linear relationship r = 0: No linear relationship |
| Sign indicates direction | r > 0: Positive relationship (y increases as x increases) r < 0: Negative relationship (y decreases as x increases) |
| Magnitude indicates strength | |r| close to 1 = strong linear relationship |r| close to 0 = weak or no linear relationship |
| Unitless | r has no units — it doesn't depend on the units of x or y |
| Order doesn't matter | r(x, y) = r(y, x) — switching x and y doesn't change r |
| Not resistant | Outliers can greatly affect r |
Interpreting Correlation Values
|r| > 0.8
Strong
0.5 < |r| < 0.8
Moderate
|r| < 0.5
Weak
⚠️ Critical Limitations of Correlation:
1. r only measures LINEAR relationships. A strong curved relationship can have r ≈ 0!
2. r is NOT resistant. A single outlier can dramatically change r.
3. Correlation does NOT imply causation! Just because two variables are correlated doesn't mean one causes the other.
4. r doesn't tell you the slope. Two scatterplots can have the same r but very different slopes.
Calculator Commands (TI-83/84)
STAT → CALC → 8:LinReg(a+bx)
This gives you r (correlation), along with the regression equation. Make sure "DiagnosticOn" is enabled to see r and r².
2.6 Linear Regression Models
When a scatterplot shows a linear pattern, we can fit a regression line to predict the response variable (y) from the explanatory variable (x).
The Regression Line Equation
Regression Line (Line of Best Fit): A line that describes how the response variable (y) changes as the explanatory variable (x) changes.
ŷ (y-hat) = predicted value of y
a = y-intercept (predicted y when x = 0)
b = slope (change in ŷ for each 1-unit increase in x)
🎯 Why ŷ instead of y?
We use ŷ (y-hat) to emphasize that these are predicted values, not actual observed values. The actual y-values may differ from ŷ due to natural variation.
Interpreting the Slope (b)
Slope Interpretation Template:
"For each additional [1 unit of x], the predicted [y variable] increases/decreases by [|b|] [units of y]."
Example: ŷ = 45 + 5x, where x = hours studied and y = test score
Slope interpretation: "For each additional hour studied, the predicted test score increases by 5 points."
✓ Includes: "each additional," "predicted," context, and units!
Interpreting the Y-Intercept (a)
Y-Intercept Interpretation Template:
"When [x variable] is 0, the predicted [y variable] is [a] [units of y]."
Example: ŷ = 45 + 5x
Y-intercept interpretation: "When a student studies 0 hours, their predicted test score is 45 points."
⚠️ Y-Intercept Warning:
The y-intercept often doesn't make practical sense! If x = 0 is outside the range of your data, the y-intercept is just a mathematical artifact. For example:
• "A person with 0 years of education has a predicted salary of −$5,000" — Nonsense!
When x = 0 is not meaningful, say: "The y-intercept has no practical interpretation in this context because [x = 0 is outside the range of the data / x = 0 is not possible]."
Using the Regression Line for Prediction
Example: Using ŷ = 45 + 5x, predict the test score for a student who studies 6 hours.
ŷ = 45 + 5(6) = 45 + 30 = 75 points
2.7 Residuals
A residual measures how far an actual data point is from the predicted value on the regression line. Residuals help us assess how well the line fits the data.
What is a Residual?
Residual: The difference between the observed (actual) value of y and the predicted value of y.
Interpreting Residuals
| Residual Value | Meaning | Interpretation |
|---|---|---|
| Positive (+) | Actual value > Predicted value Point is ABOVE the line |
The model UNDERESTIMATED this y-value |
| Negative (−) | Actual value < Predicted value Point is BELOW the line |
The model OVERESTIMATED this y-value |
| Zero (0) | Actual value = Predicted value Point is ON the line |
The model predicted this y-value exactly |
Example: Using ŷ = 45 + 5x, a student who studied 6 hours scored 82 points.
Predicted: ŷ = 45 + 5(6) = 75
Actual: y = 82
Residual = 82 − 75 = +7
Interpretation: "The student's actual score was 7 points higher than the model predicted."
Properties of Residuals
🎯 Key Properties:
• The sum of all residuals = 0 (positive and negative residuals balance out)
• The mean of residuals = 0
• Residuals are measured in the same units as y
• Smaller residuals = better predictions = better fit
2.8 Least Squares Regression
Among all possible lines we could draw through a scatterplot, the least squares regression line (LSRL) is the "best" line — it minimizes the sum of the squared residuals.
What Makes the LSRL "Best"?
Least Squares Regression Line (LSRL): The unique line that minimizes the sum of squared residuals:
This means the LSRL makes the squared vertical distances from points to the line as small as possible overall.
Formulas for the LSRL
LSRL Equation: ŷ = a + bx
Slope (b):
Y-Intercept (a):
Where r = correlation, sₓ = std dev of x, sᵧ = std dev of y, x̄ = mean of x, ȳ = mean of y
Important Property: The LSRL Passes Through (x̄, ȳ)
🎯 The LSRL always passes through the point (x̄, ȳ) — the point of averages. This is guaranteed by the formula a = ȳ − b·x̄.
This means: if you plug in x = x̄ into the equation, you get ŷ = ȳ.
The Coefficient of Determination (r²)
r² (R-squared): The proportion of the variation in y that is explained by the linear relationship with x.
Interpreting r²
Template: "[r² × 100]% of the variation in [y variable] is explained by the linear relationship with [x variable]."
Example: If r = 0.9, then r² = 0.81
Interpretation: "81% of the variation in test scores is explained by the linear relationship with hours studied."
This means 19% of the variation in test scores is due to other factors not captured by hours studied.
| r² | Interpretation |
|---|---|
| r² close to 1 | Most of the variation in y is explained by x — the line fits well |
| r² close to 0 | Little variation in y is explained by x — the line fits poorly |
| r² = 0.64 | 64% of variation in y is explained by x; 36% is unexplained |
Calculator Commands (TI-83/84)
STAT → CALC → 8:LinReg(a+bx) L1, L2
Output includes: a (y-intercept), b (slope), r² (coefficient of determination), r (correlation)
Remember: Turn on DiagnosticOn first! (2nd → 0 → DiagnosticOn)
2.9 Analyzing Departures from Linearity
Before trusting a regression line, we must check if a linear model is appropriate. Residual plots are the key diagnostic tool for detecting problems with linear models.
Residual Plots
Residual Plot: A scatterplot of the residuals (y-axis) against the explanatory variable x or the predicted values ŷ (x-axis).
Residual plots help us see patterns that aren't obvious in the original scatterplot.
What to Look For in Residual Plots
✓ GOOD: Random Scatter
No pattern, evenly scattered around 0. Linear model is appropriate!
✗ BAD: Curved Pattern
U-shape or curve = nonlinear relationship. Linear model is NOT appropriate!
✗ BAD: Fan Shape
Spread increases/decreases = non-constant variance. Predictions less reliable at extremes.
Outliers and Influential Points
| Term | Definition | Effect |
|---|---|---|
| Outlier | A point with an unusually large residual — far from the regression line in the y-direction. | May or may not affect the LSRL much, depending on its x-value. |
| High Leverage Point | A point with an extreme x-value — far from the mean of x. | Has the POTENTIAL to strongly influence the LSRL. |
| Influential Point | A point that, if removed, would substantially change the LSRL (slope, intercept, or r). | Usually a high leverage point that doesn't follow the pattern of other points. |
Extrapolation: A Dangerous Practice
⚠️ Extrapolation: Using a regression line to predict y for x-values outside the range of the original data.
This is dangerous because:
• The relationship may not continue beyond the observed data
• Predictions become increasingly unreliable farther from the data
• The pattern could change (become curved, level off, etc.)
Example: A regression model predicts test scores based on hours studied, using data where students studied 1-8 hours.
Interpolation (OK): Predicting score for 5 hours studied — within the data range.
Extrapolation (Risky): Predicting score for 15 hours studied — outside the data range. The relationship might not hold!
Checklist: Is a Linear Model Appropriate?
🎯 Before using a linear model, verify:
✓ The scatterplot shows a linear pattern (not curved)
✓ The residual plot shows random scatter around 0 (no pattern)
✓ The spread in the residual plot is roughly constant (no fan shape)
✓ There are no highly influential points distorting the model
Unit 2 Key Takeaways
Two-way tables: Joint, marginal, and conditional distributions
Association = different conditional distributions across categories
Scatterplots: Describe with D.O.F.S. (Direction, Outliers, Form, Strength)
Correlation (r): Measures strength & direction of LINEAR relationship
LSRL: ŷ = a + bx minimizes sum of squared residuals
r² = proportion of variation in y explained by x
Residual plots: Random scatter = linear model is appropriate
Residual = y − ŷ | b = r(sᵧ/sₓ) | a = ȳ − bx̄ | LSRL passes through (x̄, ȳ)
End of Unit 2 Study Guide.