Two-proportion Z-test

{{unreliable sources|date=January 2025}}

The Two-proportion Z-test (or, Two-sample proportion Z-test) is a statistical method used to determine whether the difference between the proportions of two groups, coming from a binomial distribution is statistically significant.[https://stattrek.com/hypothesis-test/difference-in-proportions Hypothesis Test: Difference Between Proportions] This approach relies on the assumption that the sample proportions follow a normal distribution under the Central Limit Theorem, allowing the construction of a z-test for hypothesis testing and confidence interval estimation. It is used in various fields to compare success rates, response rates, or other proportions across different groups.

Hypothesis test

The z-test for comparing two proportions is a Statistical hypothesis test for evaluating whether the proportion of a certain characteristic differs significantly between two independent samples. This test leverages the property that the sample proportions (which is the average of observations coming from a Bernoulli distribution) are asymptotically normal under the Central Limit Theorem, enabling the construction of a z-test.

The test involves two competing hypotheses:

  • Null hypothesis (H0): The proportions in the two populations are equal, i.e., p_1 = p_2.
  • Alternative hypothesis (H1): The proportions in the two populations are not equal, i.e., p_1 \neq p_2 (two-tailed) or p_1 > p_2 / p_1 < p_2 (one-tailed).

The z-statistic for comparing two proportions is computed using:[https://www.itl.nist.gov/div898/handbook/prc/section3/prc33.htm How can we determine whether two processes produce the same proportion of defectives?]

z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}

Where:

  • \hat{p}_1 = sample proportion in the first sample
  • \hat{p}_2 = sample proportion in the second sample
  • n_1 = size of the first sample
  • n_2 = size of the second sample
  • \hat{p} = pooled proportion, calculated as \hat{p} = \frac{x_1 + x_2}{n_1 + n_2}, where x_1 and x_2 are the counts of successes in the two samples.

The pooled proportion is used to estimate the shared probability of success under the null hypothesis, and the standard error accounts for variability across the two samples.

The z-test determines statistical significance by comparing the calculated z-statistic to a critical value. E.g., for a significance level of \alpha = 0.05 we reject the null hypothesis if |z| > 1.96 (for a two-tailed test). Or, alternatively, by computing the p-value and rejecting the null hypothesis if p < \alpha.

Confidence interval

The confidence interval for the difference between two proportions, based on the definitions above, is:

(\hat{p}_1 - \hat{p}_2) \pm z_{\alpha/2} \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}

Where:

  • z_{\alpha/2} is the critical value of the standard normal distribution (e.g., 1.96 for a 95% confidence level).

This interval provides a range of plausible values for the true difference between population proportions.

Using the z-test confidence intervals for hypothesis testing would give the same results as the chi-squared test for a two-by-two contingency table.[https://www.ncss.com/wp-content/themes/ncss/pdf/Procedures/PASS/Confidence_Intervals_for_the_Difference_Between_Two_Proportions.pdf Confidence Intervals for the Difference Between Two Proportions]{{rp|216-7}}Newcombe, R. G. 1998. 'Interval Estimation for the Difference Between Independent Proportions: Comparison of

Eleven Methods.' Statistics in Medicine, 17, pp. 873-890.{{rp|875}} Fisher’s exact test is more suitable for when the sample sizes are small.

Notice how the variance estimation is different between the hypothesis testing and the confidence intervals. The first uses a pooled variance (based on the null hypothesis), while the second has to estimate the variance using each sample separately (so as to allow for the confidence interval to accommodate a range of differences in proportions). This difference may lead to slightly different results if using the confidence interval as an alternative to the hypothesis testing method.

Minimum detectable effect (MDE)

The minimum detectable effect (MDE) is the smallest difference between two proportions (p_1 and p_2) that a statistical test can detect for a chosen Type I error level (\alpha), statistical power (1-\beta), and sample sizes (n_1 and n_2). It is commonly used in study design to determine whether the sample sizes allows for a test with sufficient sensitivity to detect meaningful differences.

The MDE for when using the (two-sided) z-test formula for comparing two proportions, incorporating critical values for \alpha and 1-\beta, and the standard errors of the proportions:COOLSerdash (https://stats.stackexchange.com/users/21054/coolserdash), Two proportion sample size calculation, URL (version: 2023-04-14): https://stats.stackexchange.com/q/612894Chow S-C, Shao J, Wang H, Lokhnygina Y (2018): Sample size calculations in clinical research. 3rd ed. CRC Press.

\text{MDE} = |p_1 - p_2| = z_{1-\alpha/2} \sqrt{p_0(1-p_0)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)} + z_{1-\beta} \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}

Where:

  • z_{1-\alpha/2}: Critical value for the significance level.
  • z_{1-\beta}: Quantile for the desired power.
  • p_0=p_1=p_2: When assuming the null is correct.

The MDE depends on the sample sizes, baseline proportions (p_1, p_2), and test parameters. When the baseline proportions are not known, they need to be assumed or roughly estimated from a small study. Larger samples or smaller power requirements leads to a smaller MDE, making the test more sensitive to smaller differences. Researchers may use the MDE to assess the feasibility of detecting meaningful differences before conducting a study.

{{hidden begin|style=width:100%|ta1=center|border=1px #aaa solid|title=[Proof]}}

The Minimal Detectable Effect (MDE) is the smallest difference, denoted as \Delta = |p_1 - p_2|, that satisfies two essential criteria in hypothesis testing:

  1. The null hypothesis (H_0: p_1 = p_2) is rejected at the specified significance level (\alpha).
  2. Statistical power (1 - \beta) is achieved under the alternative hypothesis (H_a: p_1 \neq p_2).

Given that the distribution is normal under the null and the alternative hypothesis, for the two criteria to happen, it is required that the distance of |p_1 - p_2| will be such that the critical value for rejecting the null (X_\text{critical}) is exactly in the location in which the probability of exceeding this value, under the null, is (\alpha), and also that the probability of exceeding this value, under the alternative, is 1 - \beta.

The first criterion establishes the critical value required to reject the null hypothesis. The second criterion specifies how far the alternative distribution must be from X_\text{critical} to ensure that the probability of exceeding it under the alternative hypothesis is at least 1 - \beta.[https://blog.statsig.com/calculating-sample-sizes-for-a-b-tests-7854d56c2646 Calculating Sample Sizes for A/B Tests][https://blog.x.com/engineering/en_us/a/2016/power-minimal-detectable-effect-and-bucket-size-estimation-in-ab-tests Power, minimal detectable effect, and bucket size estimation in A/B tests] (has some nice figures to illustrate the tradeoffs)

Condition 1: Rejecting H_0

Under the null hypothesis, the test statistic is based on the pooled standard error (\text{SE}_\text{null}):

Z_\text{test} = \frac

p_1 - p_2
{\text{SE}_\text{null}}, \quad \text{where } \text{SE}_\text{null} = \sqrt{p_0(1-p_0)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}.

p_0 might be estimated (as described above).

To reject H_0, the observed difference must exceed the critical threshold (Z_\text{critical} = z_{\alpha /2 } ) after properly inflating it to the SE:

|p_1 - p_2| \geq X_{critical} = z_{\alpha/2} \cdot \text{SE}_\text{null}

If the MDE is defined solely as MDE = z_{\alpha/2} \cdot \text{SE}_\text{null}, the statistical power would be only 50% because the alternative distribution is symmetric about the threshold. To achieve a higher power level, an additional component is required in the MDE calculation.

Condition 2: Achieving power 1 - \beta

Under the alternative hypothesis, the standard error is (\text{SE}_\text{alt} = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}).

It means that if the alternative distribution was centered around some value (e.g., X_\text{critical}), then the minimal |p_1 - p_2| must be at least larger than z_{\alpha/2} \cdot \text{SE}_\text{null} to ensure that the probability of detecting the difference under the alternative hypothesis is at least 1 - \beta.

Combining conditions

To meet both conditions, the total detectable difference incorporates components from both the null and alternative distributions. The MDE is defined as:

\text{MDE} = z_{1- \alpha/2} \cdot \text{SE}_\text{null} + z_{1- \beta} \cdot \text{SE}_\text{alt}.

By summing the critical thresholds from the null and adding to it the relevant quantile from the alternative distributions, the MDE ensures the test satisfies the dual requirements of rejecting H_0 at significance level \alpha and achieving statistical power of at least 1 - \beta.

{{hidden end}}

Assumptions and conditions

To ensure valid results, the following assumptions must be met:

  1. Independent random samples: The samples must be drawn independently from the populations of interest.
  2. Large sample sizes: Typically, n_1 and n_2 should exceed 30. {{citation needed|date=November 2024}}
  3. Success/failure condition: {{citation needed|date=November 2024}}
  4. n_1 \hat{p}_1 > 10 and n_1(1-\hat{p}_1) > 10
  5. n_2 \hat{p}_2 > 10 and n_2(1-\hat{p}_2) > 10

The z-test is most reliable when sample sizes are large, and all assumptions are satisfied.

Software implementation

= R =

Use prop.test() with continuity correction disabled:

prop.test(x = c(120, 150), n = c(1000, 1000), correct = FALSE)

Output includes z-test equivalent results: chi-squared statistic, p-value, and confidence interval:

2-sample test for equality of proportions without continuity correction

data: c(120, 150) out of c(1000, 1000)

X-squared = 3.8536, df = 1, p-value = 0.04964

alternative hypothesis: two.sided

95 percent confidence interval:

-5.992397e-02 -7.602882e-05

sample estimates:

prop 1 prop 2

0.12 0.15

= Python =

Use proportions_ztest from statsmodels:

from statsmodels.stats.proportion import proportions_ztest

z, p = proportions_ztest([120, 150], [1000, 1000], 0)

  1. For CI: from statsmodels.stats.proportion import proportions_diff_confint_indep

See also

References

{{reflist}}