1.1 Introducing Statistics: What Can We Learn from Data?
Statistics is the science of collecting, organizing, analyzing, and interpreting data to make informed decisions. In AP Statistics, you'll learn to transform raw numbers into meaningful insights — a skill that's essential in science, business, medicine, and everyday life.
Statistics: The science of learning from data. It involves collecting data, describing data through numerical summaries and graphs, and drawing conclusions about a larger group based on a sample.
Key Vocabulary
| Term | Definition | Example |
|---|---|---|
| Individual | The objects or people described by a set of data. Also called cases or observational units. | Each student in a class, each car in a parking lot |
| Variable | A characteristic that varies from one individual to another. | Height, eye color, GPA, favorite subject |
| Population | The entire group of individuals we want information about. | All high school students in the U.S. |
| Sample | A subset of the population that we actually collect data from. | 500 randomly selected high school students |
| Parameter | A number that describes a characteristic of the population. Usually unknown. | The true average GPA of ALL U.S. high schoolers (μ) |
| Statistic | A number that describes a characteristic of a sample. Calculated from data. | The average GPA of 500 sampled students (x̄) |
🎯 Memory Trick: Parameter = Population, Statistic = Sample. Parameters are usually represented with Greek letters (μ, σ, p), while statistics use Roman letters (x̄, s, p̂).
Types of Statistical Studies
Descriptive Statistics
Purpose: Organize, summarize, and present data.
- Tables and graphs
- Numerical summaries (mean, median, etc.)
- Describes what the data shows
Inferential Statistics
Purpose: Draw conclusions about a population based on sample data.
- Confidence intervals
- Hypothesis tests
- Makes predictions about what we don't see
1.2 The Language of Variation: Variables
Every statistical analysis begins by identifying what variables are being measured. Variables are classified into two fundamental types: categorical and quantitative. Getting this distinction right is crucial — it determines which graphs and statistics you can use.
Two Types of Variables
Categorical (Qualitative) Variables
Definition: Places individuals into groups or categories. Values are labels, not numbers.
Examples:
- Gender (male, female, non-binary)
- Eye color (blue, brown, green)
- Class rank (freshman, sophomore, junior, senior)
- Zip code, phone number, jersey number
⚠️ Numbers can be categorical! If you can't do meaningful math with them (like averaging zip codes), they're categorical.
Quantitative (Numerical) Variables
Definition: Takes numerical values for which arithmetic operations (like averaging) make sense.
Examples:
- Height (68 inches)
- Weight (150 lbs)
- Test score (85 points)
- Income ($45,000)
✓ Ask: "Does calculating an average make sense?" If yes → quantitative.
Discrete vs. Continuous Quantitative Variables
Discrete Variable: Can only take certain values (often counts/whole numbers). There are "gaps" between possible values.
Continuous Variable: Can take any value within an interval. No "gaps" — measurements can be infinitely precise.
| Type | Characteristics | Examples |
|---|---|---|
| Discrete |
• Countable values • Often whole numbers • "How many?" |
• Number of siblings (0, 1, 2, 3...) • Number of cars owned • Number of texts sent per day |
| Continuous |
• Measurable values • Can be decimals • "How much?" |
• Height (5.75 feet) • Time to run a mile (7.23 minutes) • Temperature (98.6°F) |
🎯 Quick Classification Test:
1️⃣ Can you do meaningful arithmetic? No → Categorical
2️⃣ Can you do meaningful arithmetic? Yes → Quantitative
• Can it take any value in a range? Yes → Continuous
• Are there gaps between possible values? Yes → Discrete
Common AP Exam Traps
⚠️ Watch out for these tricky variables:
• Zip codes, phone numbers, jersey numbers, Social Security numbers — these are CATEGORICAL even though they're numbers. You can't meaningfully average them.
• Age groups (20-29, 30-39, etc.) — CATEGORICAL (they're categories, not actual ages)
• Survey responses on a 1-5 scale — often treated as categorical on the AP exam, though this can vary by context
1.3 Representing a Categorical Variable with Tables
When working with categorical data, we summarize the distribution using frequency tables and relative frequency tables. These tables count how many individuals fall into each category.
Frequency Tables
Frequency Table: Shows the count (frequency) of individuals in each category of a categorical variable.
Example: Survey of 200 students' favorite subjects:
| Subject | Frequency |
|---|---|
| Math | 52 |
| English | 38 |
| Science | 45 |
| History | 35 |
| Art | 30 |
| Total | 200 |
Relative Frequency Tables
Relative Frequency: The proportion (or percentage) of individuals in each category.
Example continued: Converting to relative frequencies:
| Subject | Frequency | Relative Frequency | Percent |
|---|---|---|---|
| Math | 52 | 52/200 = 0.26 | 26% |
| English | 38 | 38/200 = 0.19 | 19% |
| Science | 45 | 45/200 = 0.225 | 22.5% |
| History | 35 | 35/200 = 0.175 | 17.5% |
| Art | 30 | 30/200 = 0.15 | 15% |
| Total | 200 | 1.00 | 100% |
🎯 When to use each:
• Frequency: When you need to know actual counts ("How many students chose Math?")
• Relative Frequency: When comparing groups of different sizes, or when context is about proportions ("What proportion of students chose Math?")
Describing the Distribution of a Categorical Variable
When describing a categorical distribution, focus on:
- Most common category (mode) and least common category
- Proportions/percentages of each category
- Context — always use the variable name and units in your description
Good description: "In the sample of 200 students, Math was the most popular subject (26%), followed by Science (22.5%) and English (19%). Art was the least popular choice at 15%."
1.4 Representing a Categorical Variable with Graphs
Visual displays make patterns in categorical data easy to see. The two main graphs for categorical data are bar graphs and pie charts.
Bar Graphs (Bar Charts)
Bar Graph: Displays each category as a rectangular bar. The height of each bar represents the frequency or relative frequency. Bars do not touch (unlike histograms).
Key Features of Bar Graphs
| Feature | Description |
|---|---|
| Gaps between bars | Bars should NOT touch — this distinguishes bar graphs from histograms and emphasizes that categories are distinct. |
| Order of bars | Can be in any order (alphabetical, by frequency, by natural order like months). Choice should aid interpretation. |
| Height represents value | Bar height = frequency or relative frequency. Never use area for bar graphs. |
| Labels required | Must have a title, labeled axes, and labeled categories. |
Pie Charts (Circle Graphs)
Pie Chart: Displays each category as a "slice" of a circle. The angle (or area) of each slice is proportional to the relative frequency of that category. The whole circle represents 100% of the data.
Bar Graph vs. Pie Chart
| Aspect | Bar Graph | Pie Chart |
|---|---|---|
| Best for | Comparing frequencies across categories; any number of categories | Showing parts of a whole (when categories sum to 100%) |
| Limitations | Harder to see proportion of whole | Hard to read with many categories (>6); slices must sum to 100% |
| AP Preference | Bar graphs are generally preferred — easier to compare exact values and work with any data | |
⚠️ Common Mistakes to Avoid:
• Don't use pie charts when categories don't represent parts of a whole
• Don't use 3D effects — they distort proportions and make graphs harder to read
• Don't let bars touch in a bar graph (that's for histograms only)
1.5 Representing a Quantitative Variable with Graphs
Quantitative data requires different graphs than categorical data. The three main displays are dotplots, stemplots, and histograms. Each shows the distribution — the pattern of how values are spread out.
Dotplots
Dotplot: A simple graph where each data value is represented by a dot above its location on a number line. Dots stack when values repeat.
Best for: Small datasets (n < 30), showing individual values, identifying clusters and gaps.
Stemplots (Stem-and-Leaf Plots)
Stemplot: Separates each data value into a stem (leading digit(s)) and a leaf (trailing digit). Shows shape of distribution while preserving actual data values.
Example: Test scores: 72, 75, 78, 81, 83, 85, 85, 87, 91, 94, 97
| 7 | 2 5 8 |
| 8 | 1 3 5 5 7 |
| 9 | 1 4 7 |
Key: 7|2 = 72
Reading: The stem "8" with leaves "1 3 5 5 7" represents scores of 81, 83, 85, 85, and 87.
🎯 Stemplot Rules:
• Always include a key (e.g., "Key: 7|2 = 72")
• Leaves should be single digits
• Order leaves from smallest to largest
• Use split stems (e.g., 7* for 70-74, 7· for 75-79) if there aren't enough stems
Histograms
Histogram: Groups quantitative data into bins (intervals) and displays the frequency or relative frequency of each bin as a bar. Bars touch (unlike bar graphs) because the data is continuous.
Histogram vs. Bar Graph: Key Differences
| Feature | Histogram | Bar Graph |
|---|---|---|
| Data type | Quantitative (numerical) | Categorical |
| Bars | Touch (continuous data) | Don't touch (separate categories) |
| X-axis | Numerical scale (intervals) | Category names |
| Bar order | Must follow numerical order | Can be any order |
⚠️ Critical Histogram Rules:
• Bins must be equal width (unless you adjust for area)
• Bars must touch — gaps suggest missing data
• Include values on the boundary in only ONE bin (standard: lower ≤ x < upper)
1.6 Describing the Distribution of a Quantitative Variable
When describing a quantitative distribution, you need to address four key characteristics. Remember the acronym SOCS (or CUSS): Shape, Outliers, Center, Spread.
SOCS: The 4 Things to Describe
Shape — Outliers — Center — Spread
Always describe ALL FOUR when analyzing a distribution!
1. Shape
Shape describes the overall pattern of the distribution. Key aspects: symmetry and modality.
Symmetric
Left and right sides are mirror images
Skewed Right
Tail extends to the RIGHT (high values)
Skewed Left
Tail extends to the LEFT (low values)
🎯 Memory Trick for Skewness: The skew is named for the direction of the TAIL, not the peak. Think: "The tail tells the tale!" A right-skewed distribution has most values on the left with a tail stretching right (like income data).
Modality (Number of Peaks)
| Term | Meaning | Example |
|---|---|---|
| Unimodal | One clear peak | Heights of adult men |
| Bimodal | Two distinct peaks | Heights of all adults (men + women combined) |
| Multimodal | More than two peaks | Rare in practice |
| Uniform | No peaks; approximately flat | Rolling a fair die many times |
2. Outliers
Outlier: A data value that is unusually far from the other values in the distribution. Outliers can significantly affect statistical calculations.
Look for values that stand apart from the main body of data. You'll learn a formal rule (1.5 × IQR) in section 1.8.
3. Center
Center: A "typical" or "middle" value. Common measures: mean (average) and median (middle value).
When describing center, estimate where the "middle" of the data lies. We'll calculate exact values in section 1.7.
4. Spread (Variability)
Spread: How much the data values vary. Are they tightly clustered or widely dispersed? Common measures: range, IQR, and standard deviation.
Complete Description Example:
"The distribution of test scores is roughly symmetric and unimodal, with most scores concentrated around the center of approximately 78. The scores spread from about 52 to 98. There appears to be one outlier at 52, which is much lower than the rest of the data."
✓ Includes all four SOCS elements with context!
1.7 Summary Statistics for a Quantitative Variable
Summary statistics are numerical measures that describe the center and spread of a distribution. Choosing the right statistics depends on the shape of the data.
Measures of Center
Mean (x̄)
Definition: The arithmetic average — sum of all values divided by the count.
• Balancing point of the data
• Sensitive to outliers and skewness
• Uses all data values
Median (M)
Definition: The middle value when data is ordered. If n is even, average the two middle values.
• Resistant to outliers
• Better for skewed data
• 50th percentile
🎯 Mean vs. Median:
• Symmetric distribution: Mean ≈ Median (use either)
• Skewed right: Mean > Median (mean pulled toward high values)
• Skewed left: Mean < Median (mean pulled toward low values)
• Outliers present: Median is more reliable
Measures of Spread
| Measure | Formula/Definition | Notes |
|---|---|---|
| Range | Max − Min | Simple but very sensitive to outliers. Uses only 2 values. |
| IQR (Interquartile Range) |
Q₃ − Q₁ | Range of the middle 50% of data. Resistant to outliers. Pairs with median. |
| Standard Deviation (s) | s = √[Σ(xᵢ − x̄)² / (n−1)] | Average distance from the mean. Not resistant to outliers. Pairs with mean. |
| Variance (s²) | s² = Σ(xᵢ − x̄)² / (n−1) | Standard deviation squared. Units are squared (harder to interpret). |
Quartiles and Percentiles
Quartiles divide ordered data into four equal parts:
• Q₁ (25th percentile): 25% of data falls below this value
• Q₂ (50th percentile): The median — 50% below
• Q₃ (75th percentile): 75% of data falls below this value
Choosing Summary Statistics
Match Statistics to Distribution Shape
Symmetric, No Outliers
Mean & Standard Deviation
Skewed or Outliers
Median & IQR
The five-number summary (Min, Q₁, Median, Q₃, Max) is always appropriate.
1.8 Graphical Representations of Summary Statistics
The boxplot (box-and-whisker plot) is a powerful visual display of the five-number summary. It's especially useful for identifying outliers and comparing distributions.
The Five-Number Summary
Five-Number Summary: Minimum, Q₁, Median, Q₃, Maximum
These five values divide the data into four quarters, each containing approximately 25% of the observations.
Anatomy of a Boxplot
Identifying Outliers: The 1.5 × IQR Rule
Outlier Rule: A data value is an outlier if it falls more than 1.5 × IQR below Q₁ or above Q₃.
Lower fence: Q₁ − 1.5 × IQR
Upper fence: Q₃ + 1.5 × IQR
Values beyond these fences are outliers.
Example: Given Q₁ = 25, Q₃ = 45
IQR = Q₃ − Q₁ = 45 − 25 = 20
Lower fence = 25 − 1.5(20) = 25 − 30 = −5
Upper fence = 45 + 1.5(20) = 45 + 30 = 75
Any value below −5 or above 75 is an outlier.
Reading Shape from Boxplots
Symmetric
Median centered in box
Whiskers equal length
Skewed Right
Median toward left
Right whisker longer
Skewed Left
Median toward right
Left whisker longer
⚠️ Boxplot Limitations:
• Cannot show modality (number of peaks) — a bimodal distribution looks like a unimodal one
• Cannot show exact shape — only gives 5 summary values
• Best used for comparing distributions, not describing single distributions in detail
1.9 Comparing Distributions of a Quantitative Variable
One of the most important skills in AP Statistics is comparing two or more distributions. Side-by-side displays and parallel analysis allow you to identify meaningful differences and similarities.
Displays for Comparison
| Display | Best For | Example Use |
|---|---|---|
| Back-to-back stemplot | Comparing two small datasets (n < 30 each) | Boys vs. girls test scores |
| Parallel boxplots | Comparing 2+ groups using five-number summaries; best for identifying outliers | Comparing SAT scores across multiple schools |
| Parallel dotplots | Comparing small datasets while showing individual values | Quiz scores for two class periods |
| Side-by-side histograms | Comparing shape of larger distributions | Age distributions for two populations |
Back-to-Back Stemplot
Example: Comparing quiz scores for Class A and Class B
| Class A | Stem | Class B |
| 8 5 2 | 6 | 3 7 |
| 9 6 5 3 1 | 7 | 2 4 5 8 |
| 7 4 2 0 | 8 | 1 3 5 6 8 9 |
| 5 1 | 9 | 2 4 7 |
Key: 2|6|3 means Class A: 62, Class B: 63
Note: Class A leaves read RIGHT-to-LEFT from the stem.
Parallel Boxplots
How to Write a Comparison
When comparing distributions, you must make explicit comparative statements using words like "greater than," "more spread," "both," etc. for each SOCS element:
Comparison Framework
Shape: "Distribution A is skewed right while Distribution B is approximately symmetric..."
Center: "The median score for Class B (75) is higher than Class A (68)..."
Spread: "Class A has more variability (IQR = 15) compared to Class B (IQR = 10)..."
Outliers: "Class C has one high outlier at 92, while Classes A and B have no apparent outliers..."
⚠️ Common Comparison Mistakes:
• No comparison words: "Class A median is 68. Class B median is 75." ❌
• Better: "Class B's median (75) is higher than Class A's (68)." ✓
• Missing SOCS elements: Only comparing centers without addressing shape, spread, or outliers.
• No context: Forgetting to mention what the data represents.
Complete Comparison Example:
"Both Class A and Class B test score distributions are roughly symmetric with no apparent outliers. However, Class B's scores are generally higher, with a median of 78 compared to Class A's median of 71. Additionally, Class B shows less variability (IQR = 12) than Class A (IQR = 18), indicating that Class B's scores are more consistently clustered around the center."
✓ Uses comparison words, addresses SOCS, includes context and numbers!
1.10 The Normal Distribution
The normal distribution is the most important distribution in statistics. Its distinctive bell shape appears naturally in many real-world contexts and forms the foundation for statistical inference.
What Makes a Distribution Normal?
Normal Distribution: A symmetric, bell-shaped curve described entirely by two parameters:
• μ (mu) — the mean (center of the distribution)
• σ (sigma) — the standard deviation (spread)
Notation: N(μ, σ)
The 68-95-99.7 Rule (Empirical Rule)
For ANY Normal Distribution:
68%
within 1σ of μ
95%
within 2σ of μ
99.7%
within 3σ of μ
Example: Heights of adult males are normally distributed with μ = 70 inches and σ = 3 inches.
• 68% of men are between 67 and 73 inches (70 ± 3)
• 95% of men are between 64 and 76 inches (70 ± 6)
• 99.7% of men are between 61 and 79 inches (70 ± 9)
Only about 2.5% of men are taller than 76 inches (above μ + 2σ).
The Standard Normal Distribution
Standard Normal Distribution: A normal distribution with μ = 0 and σ = 1.
Notation: N(0, 1) or Z-distribution
Z-Scores (Standardized Values)
Z-score: Measures how many standard deviations a value is from the mean.
• z > 0: value is above the mean
• z < 0: value is below the mean
• z = 0: value equals the mean
Example: If μ = 70 inches and σ = 3 inches, what is the z-score for a height of 76 inches?
z = (76 − 70) / 3 = 6 / 3 = 2
Interpretation: A height of 76 inches is 2 standard deviations above the mean.
Using Z-Scores for Percentiles
Once you have a z-score, you can use Table A (the standard normal table) or your calculator to find:
- The area to the left of the z-score (percentile)
- The area to the right of the z-score (1 − percentile)
- The area between two z-scores
Calculator Commands (TI-83/84)
normalcdf(lower, upper, μ, σ)
Finds the area (probability) between two values.
invNorm(area, μ, σ)
Finds the value corresponding to a given percentile (area to the left).
When Can We Use Normal Calculations?
🎯 Assessing Normality:
Before using normal distribution methods, check if the data is approximately normal:
1. Histogram/Dotplot: Should be roughly symmetric and bell-shaped
2. Normal Probability Plot (QQ Plot): Points should fall approximately along a straight line
If the data is clearly skewed or has multiple modes, normal calculations may not be appropriate.
Unit 1 Key Takeaways
Variables: Categorical vs. Quantitative (Discrete/Continuous)
Categorical displays: Bar graphs, pie charts, frequency tables
Quantitative displays: Dotplots, stemplots, histograms, boxplots
Describe distributions with SOCS: Shape, Outliers, Center, Spread
Summary statistics: Mean & SD (symmetric) or Median & IQR (skewed)
Normal distribution: 68-95-99.7 rule, z-scores, percentiles
z = (x − μ) / σ | IQR = Q₃ − Q₁ | Outlier if beyond Q₁ − 1.5(IQR) or Q₃ + 1.5(IQR)
End of Unit 1 Study Guide.