AP Statistics – Unit 8: Inference for Categorical Data: Chi-Square

8.1 Introducing Statistics: Are My Results Unexpected?

Until now, our categorical inference has been limited to just one or two proportions (like "success" vs "failure"). But what if we have a categorical variable with three or more categories? We can't use a z-test anymore. We need a way to measure how much an entire set of Observed counts differs from a set of Expected counts.

The Chi-Square ($\chi^2$) Statistic: A measure of how far the observed counts in a sample are from the expected counts assuming a null hypothesis is true. It squares the differences so that positive and negative deviations don't cancel each other out.

Because the $\chi^2$ statistic squares all values, it can never be negative. The distribution is strongly skewed to the right, though it becomes more symmetric as the degrees of freedom increase.

The Chi-Square Distribution Family 0 $\chi^2$ df = 2 df = 5 df = 10 p-value area
Exam Tip: Chi-Square tests are ALWAYS one-sided (right-tailed). A large $\chi^2$ value means the observations are far from what was expected, which gives evidence against the null hypothesis.

8.2 Setting Up a Chi-Square Goodness of Fit Test

The Chi-Square Goodness of Fit (GOF) Test is used when we have one single sample from a population and we want to see if the distribution of a categorical variable matches a claimed distribution.

Hypotheses for GOF:

H₀: The stated distribution of the categorical variable in the population IS correct.
Hₐ: The stated distribution of the categorical variable in the population IS NOT correct.

Calculating Expected Counts

To run the test, we must calculate what we *expect* to see if H₀ is true. For a GOF test, the expected count for each category is:

Expected Count = $n \times p_i$

Where $n$ is total sample size, and $p_i$ is the claimed proportion for that specific category.

⚠️ THE CONDITIONS TRAP: For the "Large Counts" condition in Chi-Square, we check that ALL EXPECTED COUNTS $\ge 5$. Do NOT use the $\ge 10$ rule from unit 6, and do not check observed counts! It must be Expected Counts $\ge 5$.

8.3 Carrying Out a Chi-Square Test for Goodness of Fit

Once you have your observed counts (from the sample) and your expected counts (calculated from H₀), you calculate the $\chi^2$ test statistic.

Chi-Square Test Statistic Formula

$\chi^2$ = $\sum$
(Observed − Expected)²
Expected

Degrees of Freedom (df): $k - 1$
(where $k$ is the number of categories)

Example: A bag of Skittles claims to have an equal distribution of 5 colors (20% each). You sample 50 Skittles. The Expected Count for each color is $50 \times 0.20 = 10$. If you count 15 Red Skittles, the contribution of Red to the $\chi^2$ total is:

(15 - 10)² / 10 = 25 / 10 = 2.5

You calculate this for all 5 colors and add them up to get your final $\chi^2$ test statistic. The degrees of freedom would be $5 - 1 = 4$.

Calculator Commands (TI-83/84)

STAT ➔ TESTS ➔ D: $\chi^2$ GOF-Test

Put Observed counts in L1, Expected counts in L2. Enter the $df$ ($k-1$).

8.4 Expected Counts in Two-Way Tables

When we deal with two categorical variables (or two+ populations), our data is presented in a two-way matrix table. The rule for finding expected counts changes here.

Expected Count Formula for a Two-Way Table

Expected Count =
(Row Total) × (Column Total)
Table Total

Why does this work? If the row and column variables are independent, the probability of being in a specific cell is $P(\text{Row}) \times P(\text{Col})$. Multiplying this joint probability by the Table Total gives the formula above!

8.5 Setting Up a Chi-Square Test for Homogeneity or Independence

There are two types of Chi-Square tests that use two-way tables. The math is identical, but the setup, data collection, and hypotheses are different.

Test for Homogeneity

Data: Multiple independent samples, looking at ONE categorical variable.

H₀: There is NO difference in the distribution of the categorical variable across the different populations.

Hₐ: There IS a difference in the distribution.

Test for Independence

Data: ONE single sample, looking at TWO categorical variables.

H₀: There is NO association between Variable A and Variable B (they are independent).

Hₐ: There IS an association between Variable A and Variable B (they are not independent).

Exam Tip: If the problem says "A researcher took a random sample of 200 adults and asked their gender and voting preference," that is ONE sample, TWO variables $\rightarrow$ Independence. If it says "A researcher took a random sample of 100 men and a separate random sample of 100 women and asked their voting preference," that is TWO samples, ONE variable $\rightarrow$ Homogeneity.

8.6 Carrying Out a Chi-Square Test for Homogeneity or Independence

Whether it's Homogeneity or Independence, the execution phase is exactly the same.

Conditions to Check:

  • Random: Data comes from random sample(s) or randomized experiment.
  • 10%: Sample(s) $\le 10\%$ of population(s).
  • Large Counts: ALL expected counts are $\ge 5$.

Test Statistic & Degrees of Freedom

The formula for $\chi^2$ is exactly the same as GOF: $\sum \frac{(O-E)^2}{E}$.

However, the degrees of freedom changes for a two-way table:

$df$ = (Rows − 1) × (Columns − 1)

Calculator Commands (TI-83/84)

MATRIX ➔ EDIT ➔ [A] Enter your Observed counts matrix.

STAT ➔ TESTS ➔ C: $\chi^2$-Test

The calculator will automatically calculate the expected counts and store them in Matrix [B] for you! You can go back to MATRIX [B] to copy them onto your paper to prove the Large Counts condition.

8.7 Skills Focus: Selecting an Appropriate Inference Procedure

The AP Exam will often ask you to simply identify WHICH test is appropriate based on the design of the study. Let's summarize the differences.

Test Name Number of Samples/Groups Number of Variables Key Question Being Asked
Goodness of Fit (GOF) 1 Sample 1 Categorical Variable "Does the sample distribution match the claimed population distribution?"
Homogeneity 2 or more Samples 1 Categorical Variable "Are the distributions the same across the different populations?"
Independence 1 Sample 2 Categorical Variables "Is there an association between Variable A and Variable B?"

Unit 8 Key Takeaways

$\chi^2 = \sum \frac{(Obs - Exp)^2}{Exp}$

Large Counts Condition: All EXPECTED counts $\ge 5$.

GOF $df = k - 1$ | Two-Way $df = (r-1)(c-1)$.

Homogeneity = Multiple samples. Independence = 1 sample, 2 variables.

Let the calculator build your Expected Matrix [B] for you!

End of Unit 8 Study Guide. Ready for the slopes in Unit 9!

← Unit 7 Unit 8 Quiz →