unit-1 – HighFiveAP

1.1 Introducing Statistics: What Can We Learn from Data?

Statistics is the science of collecting, organizing, analyzing, and interpreting data to make informed decisions. In AP Statistics, you'll learn to transform raw numbers into meaningful insights — a skill that's essential in science, business, medicine, and everyday life.

Statistics: The science of learning from data. It involves collecting data, describing data through numerical summaries and graphs, and drawing conclusions about a larger group based on a sample.

Key Vocabulary

Term	Definition	Example
Individual	The objects or people described by a set of data. Also called cases or observational units.	Each student in a class, each car in a parking lot
Variable	A characteristic that varies from one individual to another.	Height, eye color, GPA, favorite subject
Population	The entire group of individuals we want information about.	All high school students in the U.S.
Sample	A subset of the population that we actually collect data from.	500 randomly selected high school students
Parameter	A number that describes a characteristic of the population. Usually unknown.	The true average GPA of ALL U.S. high schoolers (μ)
Statistic	A number that describes a characteristic of a sample. Calculated from data.	The average GPA of 500 sampled students (x̄)

🎯 Memory Trick: Parameter = Population, Statistic = Sample. Parameters are usually represented with Greek letters (μ, σ, p), while statistics use Roman letters (x̄, s, p̂).

Types of Statistical Studies

Descriptive Statistics

Purpose: Organize, summarize, and present data.

Tables and graphs
Numerical summaries (mean, median, etc.)
Describes what the data shows

Inferential Statistics

Purpose: Draw conclusions about a population based on sample data.

Confidence intervals
Hypothesis tests
Makes predictions about what we don't see

Exam Tip: Unit 1 focuses on descriptive statistics. Inferential statistics are covered in Units 6-9. Understand the distinction — the AP exam loves asking whether a conclusion is appropriate based on the type of analysis performed.

1.2 The Language of Variation: Variables

Every statistical analysis begins by identifying what variables are being measured. Variables are classified into two fundamental types: categorical and quantitative. Getting this distinction right is crucial — it determines which graphs and statistics you can use.

Two Types of Variables

Categorical (Qualitative) Variables

Definition: Places individuals into groups or categories. Values are labels, not numbers.

Examples:

Gender (male, female, non-binary)
Eye color (blue, brown, green)
Class rank (freshman, sophomore, junior, senior)
Zip code, phone number, jersey number

⚠️ Numbers can be categorical! If you can't do meaningful math with them (like averaging zip codes), they're categorical.

Quantitative (Numerical) Variables

Definition: Takes numerical values for which arithmetic operations (like averaging) make sense.

Examples:

Height (68 inches)
Weight (150 lbs)
Test score (85 points)
Income ($45,000)

✓ Ask: "Does calculating an average make sense?" If yes → quantitative.

Discrete vs. Continuous Quantitative Variables

Discrete Variable: Can only take certain values (often counts/whole numbers). There are "gaps" between possible values.

Continuous Variable: Can take any value within an interval. No "gaps" — measurements can be infinitely precise.

Type	Characteristics	Examples
Discrete	• Countable values • Often whole numbers • "How many?"	• Number of siblings (0, 1, 2, 3...) • Number of cars owned • Number of texts sent per day
Continuous	• Measurable values • Can be decimals • "How much?"	• Height (5.75 feet) • Time to run a mile (7.23 minutes) • Temperature (98.6°F)

🎯 Quick Classification Test:
1️⃣ Can you do meaningful arithmetic? No → Categorical
2️⃣ Can you do meaningful arithmetic? Yes → Quantitative
• Can it take any value in a range? Yes → Continuous
• Are there gaps between possible values? Yes → Discrete

Common AP Exam Traps

⚠️ Watch out for these tricky variables:

• Zip codes, phone numbers, jersey numbers, Social Security numbers — these are CATEGORICAL even though they're numbers. You can't meaningfully average them.

• Age groups (20-29, 30-39, etc.) — CATEGORICAL (they're categories, not actual ages)

• Survey responses on a 1-5 scale — often treated as categorical on the AP exam, though this can vary by context

Exam Tip: The AP exam frequently asks you to identify variable types. Always ask: "Does averaging these values make sense in context?"

1.3 Representing a Categorical Variable with Tables

When working with categorical data, we summarize the distribution using frequency tables and relative frequency tables. These tables count how many individuals fall into each category.

Frequency Tables

Frequency Table: Shows the count (frequency) of individuals in each category of a categorical variable.

Example: Survey of 200 students' favorite subjects:

Subject	Frequency
Math	52
English	38
Science	45
History	35
Art	30
Total	200

Relative Frequency Tables

Relative Frequency: The proportion (or percentage) of individuals in each category.

Relative Frequency =

Frequency of category

Total count

Example continued: Converting to relative frequencies:

Subject	Frequency	Relative Frequency	Percent
Math	52	52/200 = 0.26	26%
English	38	38/200 = 0.19	19%
Science	45	45/200 = 0.225	22.5%
History	35	35/200 = 0.175	17.5%
Art	30	30/200 = 0.15	15%
Total	200	1.00	100%

🎯 When to use each:
• Frequency: When you need to know actual counts ("How many students chose Math?")
• Relative Frequency: When comparing groups of different sizes, or when context is about proportions ("What proportion of students chose Math?")

Describing the Distribution of a Categorical Variable

When describing a categorical distribution, focus on:

Most common category (mode) and least common category
Proportions/percentages of each category
Context — always use the variable name and units in your description

Good description: "In the sample of 200 students, Math was the most popular subject (26%), followed by Science (22.5%) and English (19%). Art was the least popular choice at 15%."

Exam Tip: On free-response questions, always include context (variable name, who/what was studied) and specific numbers (percentages or counts) in your descriptions.

1.4 Representing a Categorical Variable with Graphs

Visual displays make patterns in categorical data easy to see. The two main graphs for categorical data are bar graphs and pie charts.

Bar Graphs (Bar Charts)

Bar Graph: Displays each category as a rectangular bar. The height of each bar represents the frequency or relative frequency. Bars do not touch (unlike histograms).

Key Features of Bar Graphs

Feature	Description
Gaps between bars	Bars should NOT touch — this distinguishes bar graphs from histograms and emphasizes that categories are distinct.
Order of bars	Can be in any order (alphabetical, by frequency, by natural order like months). Choice should aid interpretation.
Height represents value	Bar height = frequency or relative frequency. Never use area for bar graphs.
Labels required	Must have a title, labeled axes, and labeled categories.

Pie Charts (Circle Graphs)

Pie Chart: Displays each category as a "slice" of a circle. The angle (or area) of each slice is proportional to the relative frequency of that category. The whole circle represents 100% of the data.

Bar Graph vs. Pie Chart

Aspect	Bar Graph	Pie Chart
Best for	Comparing frequencies across categories; any number of categories	Showing parts of a whole (when categories sum to 100%)
Limitations	Harder to see proportion of whole	Hard to read with many categories (>6); slices must sum to 100%
AP Preference	Bar graphs are generally preferred — easier to compare exact values and work with any data

⚠️ Common Mistakes to Avoid:

• Don't use pie charts when categories don't represent parts of a whole

• Don't use 3D effects — they distort proportions and make graphs harder to read

• Don't let bars touch in a bar graph (that's for histograms only)

Exam Tip: When creating graphs on the AP exam, always include: (1) a title, (2) labeled axes, (3) a scale, and (4) appropriate spacing. Missing labels = lost points!

1.5 Representing a Quantitative Variable with Graphs

Quantitative data requires different graphs than categorical data. The three main displays are dotplots, stemplots, and histograms. Each shows the distribution — the pattern of how values are spread out.

Dotplots

Dotplot: A simple graph where each data value is represented by a dot above its location on a number line. Dots stack when values repeat.

Best for: Small datasets (n < 30), showing individual values, identifying clusters and gaps.

Stemplots (Stem-and-Leaf Plots)

Stemplot: Separates each data value into a stem (leading digit(s)) and a leaf (trailing digit). Shows shape of distribution while preserving actual data values.

Example: Test scores: 72, 75, 78, 81, 83, 85, 85, 87, 91, 94, 97

            72 5 8
81 3 5 5 7
91 4 7
Key: 7|2 = 72

Reading: The stem "8" with leaves "1 3 5 5 7" represents scores of 81, 83, 85, 85, and 87.

🎯 Stemplot Rules:
• Always include a key (e.g., "Key: 7|2 = 72")
• Leaves should be single digits
• Order leaves from smallest to largest
• Use split stems (e.g., 7* for 70-74, 7· for 75-79) if there aren't enough stems

Histograms

Histogram: Groups quantitative data into bins (intervals) and displays the frequency or relative frequency of each bin as a bar. Bars touch (unlike bar graphs) because the data is continuous.

Histogram vs. Bar Graph: Key Differences

Feature	Histogram	Bar Graph
Data type	Quantitative (numerical)	Categorical
Bars	Touch (continuous data)	Don't touch (separate categories)
X-axis	Numerical scale (intervals)	Category names
Bar order	Must follow numerical order	Can be any order

⚠️ Critical Histogram Rules:

• Bins must be equal width (unless you adjust for area)

• Bars must touch — gaps suggest missing data

• Include values on the boundary in only ONE bin (standard: lower ≤ x < upper)

Exam Tip: On the AP exam, "histogram" and "bar graph" are NOT interchangeable. Using a histogram for categorical data (or vice versa) is incorrect and will cost you points.

1.6 Describing the Distribution of a Quantitative Variable

When describing a quantitative distribution, you need to address four key characteristics. Remember the acronym SOCS (or CUSS): Shape, Outliers, Center, Spread.

SOCS: The 4 Things to Describe

Shape — Outliers — Center — Spread

Always describe ALL FOUR when analyzing a distribution!

1. Shape

Shape describes the overall pattern of the distribution. Key aspects: symmetry and modality.

Symmetric

Left and right sides are mirror images

Skewed Right

Tail extends to the RIGHT (high values)

Skewed Left

Tail extends to the LEFT (low values)

🎯 Memory Trick for Skewness: The skew is named for the direction of the TAIL, not the peak. Think: "The tail tells the tale!" A right-skewed distribution has most values on the left with a tail stretching right (like income data).

Modality (Number of Peaks)

Term	Meaning	Example
Unimodal	One clear peak	Heights of adult men
Bimodal	Two distinct peaks	Heights of all adults (men + women combined)
Multimodal	More than two peaks	Rare in practice
Uniform	No peaks; approximately flat	Rolling a fair die many times

2. Outliers

Outlier: A data value that is unusually far from the other values in the distribution. Outliers can significantly affect statistical calculations.

Look for values that stand apart from the main body of data. You'll learn a formal rule (1.5 × IQR) in section 1.8.

3. Center

Center: A "typical" or "middle" value. Common measures: mean (average) and median (middle value).

When describing center, estimate where the "middle" of the data lies. We'll calculate exact values in section 1.7.

4. Spread (Variability)

Spread: How much the data values vary. Are they tightly clustered or widely dispersed? Common measures: range, IQR, and standard deviation.

Complete Description Example:

"The distribution of test scores is roughly symmetric and unimodal, with most scores concentrated around the center of approximately 78. The scores spread from about 52 to 98. There appears to be one outlier at 52, which is much lower than the rest of the data."

✓ Includes all four SOCS elements with context!

Exam Tip: On free-response questions, you MUST address all four SOCS elements AND use context (the variable name and units). Saying "the distribution is skewed right" without context loses points. Say "the distribution of household incomes is skewed right."

1.7 Summary Statistics for a Quantitative Variable

Summary statistics are numerical measures that describe the center and spread of a distribution. Choosing the right statistics depends on the shape of the data.

Measures of Center

Mean (x̄)

Definition: The arithmetic average — sum of all values divided by the count.

x̄ =

Σxᵢ

• Balancing point of the data
• Sensitive to outliers and skewness
• Uses all data values

Median (M)

Definition: The middle value when data is ordered. If n is even, average the two middle values.

Position of median: (n+1)/2

• Resistant to outliers
• Better for skewed data
• 50th percentile

🎯 Mean vs. Median:
• Symmetric distribution: Mean ≈ Median (use either)
• Skewed right: Mean > Median (mean pulled toward high values)
• Skewed left: Mean < Median (mean pulled toward low values)
• Outliers present: Median is more reliable

Measures of Spread

Measure	Formula/Definition	Notes
Range	Max − Min	Simple but very sensitive to outliers. Uses only 2 values.
IQR (Interquartile Range)	Q₃ − Q₁	Range of the middle 50% of data. Resistant to outliers. Pairs with median.
Standard Deviation (s)	s = √[Σ(xᵢ − x̄)² / (n−1)]	Average distance from the mean. Not resistant to outliers. Pairs with mean.
Variance (s²)	s² = Σ(xᵢ − x̄)² / (n−1)	Standard deviation squared. Units are squared (harder to interpret).

Quartiles and Percentiles

Quartiles divide ordered data into four equal parts:

• Q₁ (25th percentile): 25% of data falls below this value
• Q₂ (50th percentile): The median — 50% below
• Q₃ (75th percentile): 75% of data falls below this value

Choosing Summary Statistics

Match Statistics to Distribution Shape

Symmetric, No Outliers

Mean & Standard Deviation

Skewed or Outliers

Median & IQR

The five-number summary (Min, Q₁, Median, Q₃, Max) is always appropriate.

Exam Tip: Standard deviation can NEVER be negative (it's a distance). It equals zero only when all data values are identical. The AP exam loves asking about properties of standard deviation.

1.8 Graphical Representations of Summary Statistics

The boxplot (box-and-whisker plot) is a powerful visual display of the five-number summary. It's especially useful for identifying outliers and comparing distributions.

The Five-Number Summary

Five-Number Summary: Minimum, Q₁, Median, Q₃, Maximum

These five values divide the data into four quarters, each containing approximately 25% of the observations.

Anatomy of a Boxplot

Identifying Outliers: The 1.5 × IQR Rule

Outlier Rule: A data value is an outlier if it falls more than 1.5 × IQR below Q₁ or above Q₃.

Lower fence: Q₁ − 1.5 × IQR

Upper fence: Q₃ + 1.5 × IQR

Values beyond these fences are outliers.

Example: Given Q₁ = 25, Q₃ = 45

IQR = Q₃ − Q₁ = 45 − 25 = 20

Lower fence = 25 − 1.5(20) = 25 − 30 = −5

Upper fence = 45 + 1.5(20) = 45 + 30 = 75

Any value below −5 or above 75 is an outlier.

Reading Shape from Boxplots

Symmetric

Median centered in box
Whiskers equal length

Skewed Right

Median toward left
Right whisker longer

Skewed Left

Median toward right
Left whisker longer

⚠️ Boxplot Limitations:

• Cannot show modality (number of peaks) — a bimodal distribution looks like a unimodal one

• Cannot show exact shape — only gives 5 summary values

• Best used for comparing distributions, not describing single distributions in detail

Exam Tip: When asked to identify outliers, SHOW YOUR WORK using the 1.5 × IQR rule. Simply circling a point isn't enough — calculate the fences and compare.

1.9 Comparing Distributions of a Quantitative Variable

One of the most important skills in AP Statistics is comparing two or more distributions. Side-by-side displays and parallel analysis allow you to identify meaningful differences and similarities.

Displays for Comparison

Display	Best For	Example Use
Back-to-back stemplot	Comparing two small datasets (n < 30 each)	Boys vs. girls test scores
Parallel boxplots	Comparing 2+ groups using five-number summaries; best for identifying outliers	Comparing SAT scores across multiple schools
Parallel dotplots	Comparing small datasets while showing individual values	Quiz scores for two class periods
Side-by-side histograms	Comparing shape of larger distributions	Age distributions for two populations

Back-to-Back Stemplot

Example: Comparing quiz scores for Class A and Class B

            Class AStemClass B
8 5 263 7
9 6 5 3 172 4 5 8
7 4 2 081 3 5 6 8 9
5 192 4 7
Key: 2|6|3 means Class A: 62, Class B: 63

Note: Class A leaves read RIGHT-to-LEFT from the stem.

Parallel Boxplots

How to Write a Comparison

When comparing distributions, you must make explicit comparative statements using words like "greater than," "more spread," "both," etc. for each SOCS element:

Comparison Framework

Shape: "Distribution A is skewed right while Distribution B is approximately symmetric..."

Center: "The median score for Class B (75) is higher than Class A (68)..."

Spread: "Class A has more variability (IQR = 15) compared to Class B (IQR = 10)..."

Outliers: "Class C has one high outlier at 92, while Classes A and B have no apparent outliers..."

⚠️ Common Comparison Mistakes:

• No comparison words: "Class A median is 68. Class B median is 75." ❌

• Better: "Class B's median (75) is higher than Class A's (68)." ✓

• Missing SOCS elements: Only comparing centers without addressing shape, spread, or outliers.

• No context: Forgetting to mention what the data represents.

Complete Comparison Example:

"Both Class A and Class B test score distributions are roughly symmetric with no apparent outliers. However, Class B's scores are generally higher, with a median of 78 compared to Class A's median of 71. Additionally, Class B shows less variability (IQR = 12) than Class A (IQR = 18), indicating that Class B's scores are more consistently clustered around the center."

✓ Uses comparison words, addresses SOCS, includes context and numbers!

Exam Tip: On the AP exam, comparison questions require COMPARATIVE LANGUAGE. You won't receive full credit for describing each distribution separately without explicitly comparing them.

1.10 The Normal Distribution

The normal distribution is the most important distribution in statistics. Its distinctive bell shape appears naturally in many real-world contexts and forms the foundation for statistical inference.

What Makes a Distribution Normal?

Normal Distribution: A symmetric, bell-shaped curve described entirely by two parameters:

• μ (mu) — the mean (center of the distribution)
• σ (sigma) — the standard deviation (spread)

Notation: N(μ, σ)

The 68-95-99.7 Rule (Empirical Rule)

For ANY Normal Distribution:

68%

within 1σ of μ

95%

within 2σ of μ

99.7%

within 3σ of μ

Example: Heights of adult males are normally distributed with μ = 70 inches and σ = 3 inches.

• 68% of men are between 67 and 73 inches (70 ± 3)

• 95% of men are between 64 and 76 inches (70 ± 6)

• 99.7% of men are between 61 and 79 inches (70 ± 9)

Only about 2.5% of men are taller than 76 inches (above μ + 2σ).

The Standard Normal Distribution

Standard Normal Distribution: A normal distribution with μ = 0 and σ = 1.

Notation: N(0, 1) or Z-distribution

Z-Scores (Standardized Values)

Z-score: Measures how many standard deviations a value is from the mean.

z =

x − μ

• z > 0: value is above the mean

• z < 0: value is below the mean

• z = 0: value equals the mean

Example: If μ = 70 inches and σ = 3 inches, what is the z-score for a height of 76 inches?

z = (76 − 70) / 3 = 6 / 3 = 2

Interpretation: A height of 76 inches is 2 standard deviations above the mean.

Using Z-Scores for Percentiles

Once you have a z-score, you can use Table A (the standard normal table) or your calculator to find:

The area to the left of the z-score (percentile)
The area to the right of the z-score (1 − percentile)
The area between two z-scores

Calculator Commands (TI-83/84)

normalcdf(lower, upper, μ, σ)

Finds the area (probability) between two values.

invNorm(area, μ, σ)

Finds the value corresponding to a given percentile (area to the left).

When Can We Use Normal Calculations?

🎯 Assessing Normality:

Before using normal distribution methods, check if the data is approximately normal:

1. Histogram/Dotplot: Should be roughly symmetric and bell-shaped

2. Normal Probability Plot (QQ Plot): Points should fall approximately along a straight line

If the data is clearly skewed or has multiple modes, normal calculations may not be appropriate.

Exam Tip: The AP exam provides Table A for z-scores but also allows calculators. Know both methods! When showing work, write the calculator command you used (e.g., normalcdf(-999, 1.5, 0, 1) = 0.9332).

Unit 1 Key Takeaways

Variables: Categorical vs. Quantitative (Discrete/Continuous)

Categorical displays: Bar graphs, pie charts, frequency tables

Quantitative displays: Dotplots, stemplots, histograms, boxplots

Describe distributions with SOCS: Shape, Outliers, Center, Spread

Summary statistics: Mean & SD (symmetric) or Median & IQR (skewed)

Normal distribution: 68-95-99.7 rule, z-scores, percentiles

z = (x − μ) / σ | IQR = Q₃ − Q₁ | Outlier if beyond Q₁ − 1.5(IQR) or Q₃ + 1.5(IQR)

End of Unit 1 Study Guide.