multilevel regression with poststratification

{{Short description|Statistical regression technique}}

{{Use dmy dates|date=May 2024}}

{{Regression bar}}

Multilevel regression with poststratification (MRP) is a statistical technique used for correcting model estimates for known differences between a sample population (the population of the data one has), and a target population (a population one wishes to estimate for).

The poststratification refers to the process of adjusting the estimates, essentially a weighted average of estimates from all possible combinations of attributes (for example age and sex). Each combination is sometimes called a "cell". The multilevel regression is the use of a multilevel model to smooth noisy estimates in the cells with too little data by using overall or nearby averages.

One application is estimating preferences in sub-regions (e.g., states, individual constituencies) based on individual-level survey data gathered at other levels of aggregation (e.g., national surveys).{{cite journal |last1=Buttice |first1=Matthew K. |last2=Highton |first2=Benjamin |title=How Does Multilevel Regression and Poststratification Perform with Conventional National Surveys? |journal=Political Analysis |date=Autumn 2013 |volume=21 |issue=4 |pages=449–451 |jstor=24572674 |doi=10.1093/pan/mpt017 |url=https://escholarship.org/content/qt5wc2g12h/qt5wc2g12h.pdf?t=o18zd3 |url-access= |url-status=live |archive-url=https://web.archive.org/web/20250212074916/https://escholarship.org/content/qt5wc2g12h/qt5wc2g12h.pdf?t=o18zd3 |archive-date=12 February 2025 |access-date=16 February 2025 }}

Individual seat polls can struggle to have a high enough sample size, while MRPs have such large sample sizes that even smaller sub-demographics (eg grouping by age, or cultural background) will have a high enough sample size, which can then be used to adjust seat forecasts.

Mathematical formulation

Following the MRP model description,{{cite journal |last1=Downes|first1=Marnie Downes |last2=at al. | title=Multilevel Regression and Poststratification: A Modeling Approach to Estimating Population Quantities From Highly Selected Survey Samples |journal=American Journal of Epidemiology |date=August 2018|volume=187 |issue=8 |pages=1780–1790 |doi=10.1093/aje/kwy070 |pmid=29635276 |url=https://doi.org/10.1093/aje/kwy070}} assume Y represents single outcome measurement and the population mean value of Y, \mu_Y, is the target parameter of interest. In the underlying population, each individual, i, belongs to one of j = 1,2, \cdots, J poststratification cells characterized by a unique set of covariates. The multilevel regression with poststratification model involves the following pair of steps:

MRP step 1 (multilevel regression): The multilevel regression model specifies a linear predictor for the mean \mu_Y, or the logit transform of the mean in the case of a binary outcome, in poststratification cell j,

g{\left({\mathrm\mu}_j\right)}=g{\left(E{\left[Y_{j{\lbrack i\rbrack}}\right]}\right)}={\mathrm\beta}_0+\boldsymbol X_j^T\mathbf\beta+\sum_{k=1}^Ka_{l{\lbrack j\rbrack}}^k,

where Y_{j\lbrack i\rbrack} is the outcome measurement for respondent i in cell j, \beta_0 is the fixed intercept, \boldsymbol X_j is the unique covariate vector for cell j, {\mathrm\beta} is a vector of regression coefficients (fixed effects), a_{l{\lbrack j\rbrack}}^k is the varying coefficient (random effect), l{\lbrack j\rbrack} maps the j cell index to the corresponding category index l of variable k\in\{1,2, \cdots, K\}. All varying coefficients are exchangeable batches with independent normal prior distributions

a_l^k\sim \mathrm N\left(0,\mathrm\sigma_k^2\right),\ l\in\{1,\dots,L_k\}.

MRP step 2: poststratification: The poststratification (PS) estimate for the population parameter of interest is

\hat{\mu}^{PS} = \frac{\sum_{j=1}^{J} N_j \hat{\mu}_j}{\sum_{i=1}^J N_j}

where \hat{\mu}_j is the estimated outcome of interest for poststratification cell j and N_j

is the size of the j-th poststratification cell in the population. Estimates at any subpopulation level

s are similarly derived

\hat{\mu}^{PS}_s = \frac{\sum_{j=1}^{J_s} N_j \hat{\mu}_j}{\sum_{i=1}^{J_s} N_j}

where J_s is the subset of all poststratification cells that comprise s.

The technique and its advantages

The technique essentially involves using data from, for example, censuses relating to various types of people corresponding to different characteristics (e.g. age, race), in a first step to estimate the relationship between those types and individual preferences (i.e., multi-level regression of the dataset). This relationship is then used in a second step to estimate the sub-regional preference based on the number of people having each type or characteristic in that sub-region (a process known as "poststratification").{{cite web |title=What is MRP? |archive-date=16 February 2025 |archive-url=https://web.archive.org/web/20250216015942/https://www.survation.com/what-is-mrp/ |url-status=live |url=https://www.survation.com/what-is-mrp/ |date=5 November 2018 |publisher=Survation |accessdate=31 October 2019}} In this way the need to perform surveys at sub-regional level, which can be expensive and impractical in an area (e.g. a country) with many sub-regions (e.g. counties, ridings, or states), is avoided. It also avoids issues with consistency of survey when comparing different surveys performed in different areas.{{cite journal |archive-date=16 February 2025 |url-status=live |archive-url=https://web.archive.org/web/20250216015649/https://sites.stat.columbia.edu/gelman/research/unpublished/MRT(1).pdf |last1=Gelman |first1=Andrew |last2=Lax |first2=Jeffrey |last3=Phillips |first3=Justin |last4=Gabry |first4=Jonah |last5=Trangucci |first5=Robert |title=Using Multilevel Regression and Poststratification to Estimate Dynamic Public Opinion |date=28 August 2018 |pages=1–3 |url=http://www.stat.columbia.edu/~gelman/research/unpublished/MRT(1).pdf |website=sites.stat.columbia.edu |access-date=31 October 2019}} Additionally, it allows the estimating of preference within a specific locality based on a survey taken across a wider area that includes relatively few people from the locality in question, or where the sample may be highly unrepresentative.{{cite journal |last1=Downes |first1=Marnie |last2=Gurrin |first2=Lyle C. |last3=English |first3=Dallas R. |last4=Pirkis |first4=Jane |last5=Currier |first5=Diane |last6=Spital |first6=Matthew J. |last7=Carlin |first7=John B. |title=Multilevel Regression and Poststratification: A Modeling Approach to Estimating Population Quantities From Highly Selected Survey Samples |journal=American Journal of Epidemiology |publisher=Oxford University Press |doi=10.1093/aje/kwy070 |orig-date=9 April 2018 |volume=187 |date=August 2018 |issue=8 |pages=1780–1790 |url=https://academic.oup.com/aje/article/187/8/1780/4964985 |access-date=31 October 2019}}

History

The technique was originally developed by Gelman and T. Little in 1997,{{cite journal |last1=Gelman |first1=Andrew |last2=Little |first2=Thomas|title=Poststratification into many categories using hierarchical logistic regression | journal = Survey Methodology | date=1997 |volume=23 |pages=127–135|url=https://www150.statcan.gc.ca/n1/en/catalogue/12-001-X19970023616}} building upon ideas of Fay and Herriot{{cite journal |last1=Fay|first1=Robert |last2=Herriot|first2=Roger| title= Estimates of income for small places: An application of James-Stein procedures to census data | journal = Journal of the American Statistical Association| date=1979 |volume=74 |issue=423 |pages=1001–1012 |doi=10.1080/01621459.1979.10482505 |jstor=2286322}} and R. Little.{{cite journal|last1=Little|first1=Roderick| title= Post-stratification: A modeler's perspective | journal = Journal of the American Statistical Association| date=1993 |volume=88 |issue=423|pages=1001–1012|jstor=2290792|doi=10.1080/01621459.1993.10476368}} It was subsequently expanded on by Park, Gelman, and Bafumi in 2004 and 2006. It was proposed for use in estimating US-state-level voter preference by Lax and Philips in 2009. Warshaw and Rodden subsequently proposed it for use in estimating district-level public opinion in 2012. Later, Wang et al.{{cite journal |last1=Wang |first1=Wei |last2=Rothschild |first2=David |last3=Goel |first3=Sharad |last4=Gelman |first4=Andrew |date=2015 |title=Forecasting elections with non-representative polls |url=https://www.microsoft.com/en-us/research/wp-content/uploads/2016/04/forecasting-with-nonrepresentative-polls.pdf |journal=International Journal of Forecasting |volume=31 |issue=3 |pages=980–991 |doi=10.1016/j.ijforecast.2014.06.001 |doi-access=free |archive-date=1 November 2020 |access-date=1 December 2019 |archive-url=https://web.archive.org/web/20201101150039/https://www.microsoft.com/en-us/research/wp-content/uploads/2016/04/forecasting-with-nonrepresentative-polls.pdf |url-status=live }} used survey data of Xbox users to predict the outcome of the 2012 US presidential election. The Xbox gamers were 65% 18- to 29-year-olds and 93% male, while the electorate as a whole was 19% 18- to 29-year-olds and 47% male. Even though the original data was highly biased, after multilevel regression with poststratification the authors were able to get estimates that agreed with those coming from polls using large amounts of random and representative data. Since then it has also been proposed for use in the field of epidemiology.

YouGov used the technique to successfully predict the overall outcome of the 2017 UK general election,{{cite news |last1=Revell |first1=Timothy |title=How YouGov's experimental poll correctly called the UK election |url=https://www.newscientist.com/article/2134144-how-yougovs-experimental-poll-correctly-called-the-uk-election/#ixzz63vulf5ZP |accessdate=31 October 2019 |work=New Scientist |date=9 June 2017 |archive-date=9 June 2017 |archive-url=https://web.archive.org/web/20170609134438/https://www.newscientist.com/article/2134144-how-yougovs-experimental-poll-correctly-called-the-uk-election/#ixzz63vulf5ZP |url-status=live }} correctly predicting the result in 93% of constituencies.{{cite news |last1=Cohen |first1=Daniel |title='I've never known voters be so promiscuous': the pollsters working to predict the next UK election |url=https://www.theguardian.com/politics/2019/sep/27/voters-so-promiscuous-the-pollsters-working-to-predict-next-election |accessdate=31 October 2019 |work=The Guardian |date=27 September 2019 |archive-date=11 September 2024 |archive-url=https://web.archive.org/web/20240911131900/https://www.theguardian.com/politics/2019/sep/27/voters-so-promiscuous-the-pollsters-working-to-predict-next-election |url-status=live }} In the 2019 and 2024 elections other pollsters used MRP including Survation Survation 2019 https://www.survation.com/2019-general-election-mrp-predictions-survation-and-dr-chris-hanretty/ {{Webarchive|url=https://web.archive.org/web/20250216020436/https://www.survation.com/2019-general-election-mrp-predictions-survation-and-dr-chris-hanretty |date=16 February 2025 }} and Ipsos.Ipsos 2024 https://www.ipsos.com/en-uk/uk-opinion-polls/ipsos-election-mrp {{Webarchive|url=https://web.archive.org/web/20240618165229/https://www.ipsos.com/en-uk/uk-opinion-polls/ipsos-election-mrp |date=18 June 2024 }}

Limitations and extensions

MRP can be extended to estimating the change of opinion over time and when used to predict elections works best when used relatively close to the polling date, after nominations have closed.{{cite news |last1=James |first1=William |last2=MacLellan |first2=Kylie |title=A question of trust: British pollsters battle to call looming election |url=https://www.reuters.com/article/us-britain-eu-polls/a-question-of-trust-british-pollsters-battle-to-call-looming-election-idUSKBN1WU0GN |accessdate=31 October 2019 |date=15 October 2019 |work=Reuters |archive-date=31 October 2019 |archive-url=https://web.archive.org/web/20191031123619/https://www.reuters.com/article/us-britain-eu-polls/a-question-of-trust-british-pollsters-battle-to-call-looming-election-idUSKBN1WU0GN |url-status=live }}

Both the "multilevel regression" and "poststratification" ideas of MRP can be generalized. Multilevel regression can be replaced by nonparametric regression{{cite journal|last1=Bisbee |first1=James |title=BARP: Improving Mister P Using Bayesian Additive Regression Trees | journal = American Political Science Review| date=2019 |volume=113 |issue=4 |pages=1060–1065|doi=10.1017/S0003055419000480 |s2cid=201385400 }} or regularized prediction, and poststratification can be generalized to allow for non-census variables, i.e. poststratification totals that are estimated rather than being known.{{cite web| url=https://statmodeling.stat.columbia.edu/2018/10/28/mrp-rpp-non-census-variables/| title=MRP (or RPP) with non-census variables| last=Gelman| first=Andrew| website=Statistical Modeling, Causal Inference, and Social Science| date=28 October 2018| access-date=1 December 2019| archive-date=22 June 2019| archive-url=https://web.archive.org/web/20190622114533/https://statmodeling.stat.columbia.edu/2018/10/28/mrp-rpp-non-census-variables/| url-status=live}}

References

{{reflist}}

{{DEFAULTSORT:Multilevel regression with poststratification}}

Category:Analysis of variance

Category:Regression models