Two-proportion Z-test

From HandWiki
Short description: Statistical methods for comparing samples

The two-proportion Z-test (also called the two-sample proportion Z-test) is a statistical hypothesis test for assessing whether two groups differ in the proportion of a binary outcome, in such a significant way that is beyond chance. For example, the proportion of patients responding positively to a treatment in a clinical trial versus control, the defect rate in quality control for two production lines, or the click-through rate in an A/B test of two alternative webpage designs.

The test is appropriate when each observation is independent from another, can be classified as a success or failure (i.e., a Bernoulli trial) and the sample sizes are large enough that the sampling distribution of each sample proportion is well approximated by the central limit theorem. Under those conditions the observed difference of sample proportions can be converted to a standardized z-statistic (using a pooled standard error) and compared to the standard normal distribution to obtain p-values or form confidence intervals for the difference in proportions. This article explains the z-statistic and pooled versus unpooled variance choices, describes confidence-interval and sample-size / minimum-detectable-effect calculations, and notes common alternatives and caveats (for example, Pearson's chi-squared test or Fisher's exact test for small samples, and McNemar's test for paired binary data).

Definition

The two-proportion Z-test or two-sample proportion Z-test is a statistical method used to determine whether the difference between the proportions of two groups, coming from a binomial distribution is statistically significant.[1] This approach relies on the observation that (for a sufficiently large samples) the sample proportions follow a normal distribution under the Central Limit Theorem, allowing the construction of a z-test for hypothesis testing and confidence interval estimation.[2] It is used in various fields to compare success rates, response rates, or other proportions across different groups.

Hypothesis test

The z-test for comparing two proportions is a frequentist statistical hypothesis test used to evaluate whether two independent samples have different population proportions for a binary outcome. Under mild regularity conditions (sufficiently large sample sizes and independent sampling), the sample proportions (which is the average of observations coming from a Bernoulli distribution) are approximately normally distributed under the central limit theorem, which permits using a z-statistic constructed from the difference of sample proportions and an estimated standard error.[2]

The test involves two competing hypotheses:

  • null hypothesis (H0): The proportions in the two populations are equal, i.e., p1=p2.
  • alternative hypothesis (H1): The proportions in the two populations are not equal, i.e., p1p2 (two-tailed) or p1>p2 / p1<p2 (one-tailed).

The z-statistic for comparing two proportions is computed using:[3][4][5][2]: 10.6.2 z=p^1p^2p^(1p^)(1n1+1n2),where p^1 and p^2 are the sample proportion in the first and second sample, n1 and n2 are the size of the first and second sample, respectively, p^ is pooled proportion, calculated as p^=x1+x2n1+n2, where x1 and x2 are the counts of successes in the two samples The pooled proportion is used to estimate the shared probability of success under the null hypothesis, and the standard error accounts for variability across the two samples.

The z-test determines statistical significance by comparing the calculated z-statistic to a critical value. E.g., for a significance level of α=0.05 we reject the null hypothesis if |z|>1.96 (for a two-tailed test). Or, alternatively, by computing the p-value and rejecting the null hypothesis if p<α.

Confidence interval

The confidence interval for the difference between two proportions, based on the definitions above, is:[5][2]: 10.6.3 (p^1p^2)±zα/2p^1(1p^1)n1+p^2(1p^2)n2,where zα/2 is the critical value of the standard normal distribution (e.g., 1.96 for a 95% confidence level).

This interval provides a range of plausible values for the true difference between population proportions.

Notice how the variance estimation is different between the hypothesis testing and the confidence intervals. The first uses a pooled variance (based on the null hypothesis), while the second has to estimate the variance using each sample separately (so as to allow for the confidence interval to accommodate a range of differences in proportions). This difference may lead to slightly different results if using the confidence interval as an alternative to the hypothesis testing method.

Sample size determination and Minimum detectable effect

Sample size determination is the act of choosing the number of observations to include in each group for running the statistical test. For the Two-proportion Z-test, this is closely-related with deciding on the minimum detectable effect.

For finding the required sample size (given some effect size |p1p2|, power π=(1β), and type I error α), we define that n1=κn2, (when κ = 1, equal sample size is assumed for each group), then:[6][7]

n2=(Zα/2+Zβ)2(p1p2)2×(p1(1p1)κ+p2(1p2))2


The minimum detectable effect or MDE is the smallest difference between two proportions (p1 and p2) that a statistical test can detect for a chosen type I error level (α), statistical power (1β), and sample sizes (n1 and n2). It is commonly used in study design to determine whether the sample sizes allows for a test with sufficient sensitivity to detect meaningful differences.

The MDE for when using the (two-sided) z-test formula for comparing two proportions, incorporating critical values for α and 1β, and the standard errors of the proportions:[8][9]MDE=|p1p2|=z1α/2p0(1p0)(1n1+1n2)+z1βp1(1p1)n1+p2(1p2)n2,where z1α/2 is critical value for the significance level, z1β is quantile for the desired power, and p0=p1=p2 is when assuming the null is correct.

The MDE depends on the sample sizes, baseline proportions (p1,p2), and test parameters. When the baseline proportions are not known, they need to be assumed or roughly estimated from a small study. Larger samples or smaller power requirements leads to a smaller MDE, making the test more sensitive to smaller differences. Researchers may use the MDE to assess the feasibility of detecting meaningful differences before conducting a study.

[Proof]

The Minimal Detectable Effect (MDE) is the smallest difference, denoted as Δ=|p1p2|, that satisfies two essential criteria in hypothesis testing:

  1. The null hypothesis (H0:p1=p2) is rejected at the specified significance level (α).
  2. Statistical power (1β) is achieved under the alternative hypothesis (Ha:p1p2).

Given that the distribution is normal under the null and the alternative hypothesis, for the two criteria to happen, it is required that the distance of |p1p2| will be such that the critical value for rejecting the null (Xcritical) is exactly in the location in which the probability of exceeding this value, under the null, is (α), and also that the probability of exceeding this value, under the alternative, is 1β.

The first criterion establishes the critical value required to reject the null hypothesis. The second criterion specifies how far the alternative distribution must be from Xcritical to ensure that the probability of exceeding it under the alternative hypothesis is at least 1β.[10][11]

Condition 1: Rejecting H0

Under the null hypothesis, the test statistic is based on the pooled standard error (SEnull): Ztest=|p1p2|SEnull,where SEnull=p0(1p0)(1n1+1n2).

p0 might be estimated (as described above).

To reject H0, the observed difference must exceed the critical threshold (Zcritical=zα/2) after properly inflating it to the SE: |p1p2|Xcritical=zα/2SEnull

If the MDE is defined solely as MDE=zα/2SEnull, the statistical power would be only 50% because the alternative distribution is symmetric about the threshold. To achieve a higher power level, an additional component is required in the MDE calculation.

Condition 2: Achieving power 1β

Under the alternative hypothesis, the standard error is (SEalt=p1(1p1)n1+p2(1p2)n2). It means that if the alternative distribution was centered around some value (e.g., Xcritical), then the minimal |p1p2| must be at least larger than zα/2SEnull to ensure that the probability of detecting the difference under the alternative hypothesis is at least 1β.

Combining conditions

To meet both conditions, the total detectable difference incorporates components from both the null and alternative distributions. The MDE is defined as: MDE=z1α/2SEnull+z1βSEalt.

By summing the critical thresholds from the null and adding to it the relevant quantile from the alternative distributions, the MDE ensures the test satisfies the dual requirements of rejecting H0 at significance level α and achieving statistical power of at least 1β.

Assumptions and conditions

To ensure valid results, the following assumptions must be met:

  1. Independent random samples: The samples must be drawn independently from the populations of interest.
  2. Large sample sizes: Typically, n1+n2 should exceed 20.[12]: 48 
  3. Success or failure condition: [13][2]: 10.6.1 
    1. n1p^1>10 and n1(1p^1)>10
    2. n2p^2>10 and n2(1p^2)>10

The z-test is most reliable when sample sizes are large, and all assumptions are satisfied.

Relation to other statistical methods

Using the z-test confidence intervals for hypothesis testing would give the same results as the chi-squared test for a two-by-two contingency table.[14]: 216–7 [15]: 875  Fisher's exact test is more suitable for when the sample sizes are small.

Treatment of 2-by-2 contingency table has been investigated as early as the 19th century,[16] with further work during the 20th century.[17]

Alternatives to the asymptotic method described includes continuity correction, as well as modification that is similar to Wilson score interval.[18]

Notice that:

  • When one or more cell counts are small (e.g. below 5[12]: 48 ), prefer exact tests (e.g., Fisher's exact test) or exact confidence intervals.
  • For paired or matched binary data use McNemar's test rather than the two-sample z-test.
  • The choice between pooled and unpooled variance matters: pooled variance is appropriate for hypothesis testing of equality (H0:p1=p2), whereas the unpooled variance is used for confidence intervals.
  • Multiple testing, selection effects, and nonrandom sampling can invalidate p-values and CIs; these design issues should be addressed in the study methods.

In Bayesian inference context, proportions can be modeled using the Beta distribution. The parallel to two proportion z-test is performing similar inference using the difference of two Beta distributions.[19]

Example

Suppose group 1 has 120 successes out of 1000 trials (p^1=0.12) and group 2 has 150 successes out of 1000 trials (p^2=0.15). The pooled proportion is p^=(120+150)/(1000+1000)=0.135. The pooled standard error isSEpooled=0.135×0.865×(11000+11000)0.01529.

The z-statistic isz=0.120.150.015291.96,giving a two-sided p-value of about 0.0497 (just under 0.05). An approximate 95% confidence interval for the difference using the unpooled standard error is(0.120.15)±1.960.12×0.881000+0.15×0.851000(0.0599, 0.0001).Because the 95% CI (just barely) excludes 0 and the p-value is ≈0.0497, the difference is statistically significant at the 5% level by the usual large-sample criteria (but is borderline; conclusions should account for study context and multiple testing if applicable).

Software implementation

Implementations are available in many statistical environments. See below for implementation details in some popular languages. Other implementations also exists for SPSS,[20] SAS,[21] and Minitab.[5]

R

Use prop.test() with continuity correction disabled:

prop.test(x = c(120, 150), n = c(1000, 1000), correct = FALSE)

Output includes z-test equivalent results: chi-squared statistic, p-value, and confidence interval:

	2-sample test for equality of proportions without continuity correction

data:  c(120, 150) out of c(1000, 1000)
X-squared = 3.8536, df = 1, p-value = 0.04964
alternative hypothesis: two.sided
95 percent confidence interval:
 -5.992397e-02 -7.602882e-05
sample estimates:
prop 1 prop 2 
  0.12   0.15

Python

Use proportions_ztest from statsmodels:[22]

from statsmodels.stats.proportion import proportions_ztest
z, p = proportions_ztest([120, 150], [1000, 1000], 0)
# For CI: from statsmodels.stats.proportion import proportions_diff_confint_indep

SQL

Direct implementation of the formulas from above, using Presto flavour of SQL (relying on VALUES,[23] inverse_normal_cdf,[24] and normal_cdf[25])

WITH input_data AS (
  SELECT *, ((n_1 * p_1 + n_2 * p_2) / (n_1 + n_2)) AS p_pooled
  FROM (
    VALUES
      (1000, 1000, 0.12, 0.15)
  ) AS t (n_1, n_2, p_1, p_2)
),
stats_computed AS (
  SELECT
    n_1, n_2, p_1, p_2,
    p_2 - p_1 AS p2_minus_p1,
    SQRT(
      (p_1 * (1 - p_1) / n_1) +
      (p_2 * (1 - p_2) / n_2)
    ) AS se_p2_minus_p1,
    SQRT( p_pooled * (1 -  p_pooled) * (1.0 / n_1 + 1.0 / n_2) ) AS pooled_se,
    inverse_normal_cdf(0, 1, 0.975) AS z_975  -- for 95% CI (1.96)
  FROM input_data
)
SELECT
  n_1,
  n_2,
  ROUND(p_1, 3) AS p_1,
  ROUND(p_2, 3) AS p_2,
  ROUND(p2_minus_p1, 3) AS p2_minus_p1,
  ROUND(se_p2_minus_p1, 3) AS se_p2_minus_p1,
  ROUND(p2_minus_p1 - z_975 * se_p2_minus_p1, 3) AS p2_minus_p1_ci_lower,
  ROUND(p2_minus_p1 + z_975 * se_p2_minus_p1, 3) AS p2_minus_p1_ci_upper,
  ROUND(2 * (1 - normal_cdf(0, 1, ABS(p2_minus_p1)/pooled_se )), 3) AS p_value
FROM stats_computed;

See also

References

  1. Hypothesis Test: Difference Between Proportions
  2. 2.0 2.1 2.2 2.3 2.4 Su, Wanhua (2024). "Introduction to Applied Statistics (10.6 Inferences for Two Population Proportions)". MacEwan University Open Textbooks. https://openbooks.macewan.ca/introstats/chapter/10-6-inferences-for-two-population-proportions/. 
  3. "§10.3 Comparing Two Independent Population Proportions". Introductory Statistics 2e. OpenStax. 2023. https://openstax.org/books/introductory-statistics-2e/pages/10-3-comparing-two-independent-population-proportions. 
  4. Guthrie, William F. (2012). "§7.3.3 How can we determine whether two processes produce the same proportion of defectives?". e-Handbook of Statistical Methods. NIST/SEMATECH. doi:10.18434/M32189. https://www.itl.nist.gov/div898/handbook/prc/section3/prc33.htm. 
  5. 5.0 5.1 5.2 Kiernan, D. (2014). "4. Inferences about the Differences of Two Populations — Section 4". Natural Resources Biometrics. Milne Publishing (SUNY Geneseo). https://milnepublishing.geneseo.edu/natural-resources-biometrics/chapter/kiernan-chapter-4/. 
  6. "Statistical notes for clinical researchers: Sample size calculation 2. Comparison of two independent proportions". Restor Dent Endod 41 (2): 154–6. May 2016. doi:10.5395/rde.2016.41.2.154. PMID 27200285. 
  7. Wang, H.; Chow, S.-C. (2008). "Sample Size Calculation for Comparing Proportions". Wiley Encyclopedia of Clinical Trials. ISBN 978-1-78034-239-9. 
  8. COOLSerdash (https://stats.stackexchange.com/users/21054/coolserdash), Two proportion sample size calculation, URL (version: 2023-04-14): https://stats.stackexchange.com/q/612894
  9. Chow, S-C; Shao, J; Wang, H; Lokhnygina, Y (2018). Sample size calculations in clinical research. CRC Biostatistics (3rd ed.). CRC Press. ISBN 978-1-351-72712-9. https://books.google.com/books?id=BjkPEAAAQBAJ&pg=PR5. 
  10. Calculating Sample Sizes for A/B Tests
  11. Power, minimal detectable effect, and bucket size estimation in A/B tests (has some nice figures to illustrate the tradeoffs)
  12. 12.0 12.1 VanVoorhis, C.R. Wilson; Morgan, Betsy L. (2007). "Understanding power and rules of thumb for determining sample sizes". Tutorials in quantitative methods for psychology 3 (2): 43–50. doi:10.20982/tqmp.03.2.p043. https://www.tqmp.org/RegularArticles/vol03-2/p043/p043.pdf. 
  13. Course notes for STAT 200: Elementary Statistics. 9.1 - Two Independent Proportions, Penn State's Department of Statistics
  14. "Confidence Intervals for the Difference Between Two Proportions". PASS Sample Size Software. NCSS.com. https://www.ncss.com/wp-content/themes/ncss/pdf/Procedures/PASS/Confidence_Intervals_for_the_Difference_Between_Two_Proportions.pdf. 
  15. "Interval estimation for the difference between independent proportions: comparison of eleven methods". Stat Med 17 (8): 873–90. April 1998. doi:10.1002/(sici)1097-0258(19980430)17:8<873::aid-sim779>3.0.co;2-i. PMID 9595617. 
  16. Stigler, Stephen M. (2002). "The missing early history of contingency tables". Annales de la Faculté des sciences de Toulouse : Mathématiques 11 (4): 563–573. doi:10.5802/afst.1039. http://eudml.org/doc/73594. 
  17. Hitchcock, David B. (2009). "Yates and contingency tables: 75 years later". Electronic Journal for History of Probability & Statistics. https://people.stat.sc.edu/hitchcock/yates75tech.pdf. 
  18. Newcombe, R.G. (1998). "Interval estimation for the difference between independent proportions: comparison of eleven methods". Statist. Med. 17: 873–890. doi:10.1002/(SICI)1097-0258(19980430)17:8<873::AID-SIM779>3.0.CO;2-I. 
  19. Chen, Y., & Luo, S. (2011). A few remarks on 'Statistical distribution of the difference of two proportions' by Nadarajah and Kotz, Statistics in Medicine 2007; 26 (18): 3518-3523. Statistics in Medicine, 30(15), 1913-1915.
  20. Z-Test for 2 Independent Proportions – Quick Tutorial
  21. Usage Note 22561: Testing the equality of two or more proportions from independent samples
  22. statsmodels.stats.proportion.proportions_ztest
  23. Presto 0.295 Documentation - VALUES
  24. Mathematical Functions and Operators - Probability Functions: inverse_cdf
  25. Mathematical Functions and Operators - Probability Functions: cdf