Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Stratified Sampling for Model-Assisted Estimation with Surrogate Outcomes
0
Zitationen
3
Autoren
2026
Jahr
Abstract
In many randomized trials, outcomes such as essays or open-ended responses must be manually scored as a preliminary step to impact analysis, a process that is costly and limiting. Model-assisted estimation offers a way to combine surrogate outcomes generated by machine learning or large language models with a human-coded subset, yet typical implementations use simple random sampling and therefore overlook systematic variation in surrogate prediction error. We extend this framework by incorporating stratified sampling to more efficiently allocate human coding effort. We derive the exact variance of the stratified model-assisted estimator, characterize conditions under which stratification improves precision, and identify a Neyman-type optimal allocation rule that oversamples strata with larger residual variance. We evaluate our methods through a comprehensive simulation study to assess finite-sample performance. Overall, we find stratification consistently improves efficiency when surrogate prediction errors exhibit structured bias or heteroskedasticity. We also present two empirical applications, one using data from an education RCT and one using a large observational corpus, to illustrate how these methods can be implemented in practice using ChatGPT-generated surrogate outcomes. Overall, this framework provides a practical design-based approach for leveraging surrogate outcomes and strategically allocating human coding effort to obtain unbiased estimates with greater efficiency. While motivated by text-as-data applications, the methodology applies broadly to any setting where outcome measurement is costly or prohibitive, and can be applied to comparisons across groups or estimating the mean of a single group.
Ähnliche Arbeiten
Applied logistic regression
1990 · 35.647 Zit.
The central role of the propensity score in observational studies for causal effects
1983 · 30.423 Zit.
SPSS and SAS procedures for estimating indirect effects in simple mediation models
2004 · 16.995 Zit.
A Proportional Hazards Model for the Subdistribution of a Competing Risk
1999 · 13.370 Zit.
Asymptotic Confidence Intervals for Indirect Effects in Structural Equation Models
1982 · 12.549 Zit.