Papers
Topics
Authors
Recent
2000 character limit reached

Subgroup Average True Scores

Updated 11 December 2025
  • Subgroup Average True Scores are the expected values of subgroup outcomes computed by adjusting for measurement error.
  • They are crucial in causal inference and fairness studies, enabling precise evaluation of treatment effects and subgroup differences.
  • Recent methods like regression calibration and maximum likelihood provide robust frameworks that reduce bias and improve covariate balance.

A subgroup average true score is the expected value of an observed variable (e.g., student achievement, response, or treatment outcome) computed within a specified subgroup and purged of measurement noise or random fluctuation. Subgroup average true scores are central to evaluating fairness, effect heterogeneity, and causal impact in settings where subgroup identification (by covariates or demographics) is of primary policy or scientific interest. Their estimation and use introduce distinct methodological considerations due to the prevalence of measurement error—especially in small subgroups—and the need for inferential or covariate-balance guarantees. Recent work has established rigorous frameworks for both valid subgroup selection and robust adjustment for subgroup-level true scores, especially in regression, causal inference, and educational evaluation contexts (Cheng et al., 23 Sep 2025, Wasserman et al., 9 Dec 2025).

1. Formal Definition and Measurement Error Model

In classical measurement-error models, consider units i=1,,Ni=1,\ldots,N, each belonging to subgroups indexed by g=1,,Gg=1,\ldots,G. The observed mean for subgroup gg in unit ii is

Yig=Yigtrue+ϵig,Y_{ig} = Y^{\text{true}}_{ig} + \epsilon_{ig},

where YigtrueY^{\text{true}}_{ig} is the average true score—defined as the expected observed score if members of the subgroup were repeatedly measured—and ϵig\epsilon_{ig} is a zero-mean error term with variance σg2/mig\sigma_g^2 / m_{ig}, where migm_{ig} is the size of subgroup gg in unit ii (Wasserman et al., 9 Dec 2025). The vector Wi=(Yi1,...,YiG)W_i = (Y_{i1},...,Y_{iG})^\top denotes the observed averages, Xi=(Yi1true,...,YiGtrue)X_i = (Y^{\text{true}}_{i1},...,Y^{\text{true}}_{iG})^\top is the vector of true subgroup averages, and Σi=diag(σg2/mig)\Sigma_i = \operatorname{diag}(\sigma^2_g / m_{ig}).

The measurement error is typically assumed independent of the true score and covariates, satisfying E[ϵig]=0E[\epsilon_{ig}] = 0 and E[ϵigYigtrue]=0E[\epsilon_{ig} Y^{\text{true}}_{ig}] = 0.

2. Subgroup Average True Scores in Causal Inference and Regression

Subgroup average true scores underpin effect estimation and subgroup analysis in regression and causal inference. For i.i.d. samples (Xi,Yi)(X_i, Y_i), one defines the average response or treatment effect in a subgroup SS as

θ(S)=E[YXS].\theta(S) = \mathbb{E}[Y \mid X \in S].

In randomized trials with binary treatment WiW_i and potential outcomes Yi(1),Yi(0)Y_i'(1), Y_i'(0), inverse-propensity weighted (IPW) or augmented IPW pseudo-outcomes are employed so that E[YXS]\mathbb{E}[Y \mid X \in S] reflects the subgroup average treatment effect (ATE) (Cheng et al., 23 Sep 2025).

Controlled subgroup selection thus seeks a subset of the covariate space where the subgroup average (true) score or treatment effect meets policy-relevant criteria, e.g.,

P(θ(S^)τ)αP(\theta(\hat S) \leq \tau) \leq \alpha

for a threshold τ\tau and level α\alpha, while maximizing a utility functional such as E[(Yτ)1{XS^}]\mathbb{E}[(Y-\tau) \mathbf{1}\{X \in \hat S\}].

3. Estimation and Adjustment Methodologies

Measurement error in subgroup means leads to attenuation bias and non-negligible residual confounding in matching or weighting estimators if ignored. Addressing this, two principal adjustment strategies target balance on subgroup average true scores: regression calibration and maximum likelihood (Wasserman et al., 9 Dec 2025).

Regression Calibration (RC)

A two-level hierarchical linear model (HLM) is fit to observed WigW_{ig} to produce empirical-Bayes (EB) predictions X^igYigtrue\widehat{X}_{ig} \approx Y^{\text{true}}_{ig}, incorporating covariates and random intercepts. The calibrated propensity score is then constructed via

logit{P(Ti=1X^i,Zi)}=β0+X^iβx+Ziβz,\operatorname{logit}\{P(T_i=1 | \widehat{X}_i, Z_i)\} = \beta_0 + \widehat{X}_i^\top\beta_x + Z_i^\top \beta_z,

guaranteeing balance in the estimated true scores used as plug-ins.

Maximum Likelihood (ML)

Relaxing ignorability, the ML approach leverages the likelihood for WiXi,ZiN(Xi,Σi)W_i|X_i,Z_i \sim N(X_i, \Sigma_i) and marginalized treatment assignment to yield

e(Xi,Zi,mi)=expit(β0+wβw+Ziβz)dΦ(w;Xi,Σi),e(X_i, Z_i, m_i) = \int \mathrm{expit}(\beta_0 + w^\top\beta_w + Z_i^\top\beta_z) d\Phi(w; X_i, \Sigma_i),

with Φ\Phi the multivariate normal density. This estimator incorporates Monahan’s normal mixture approximation for computational tractability, permitting direct balance in XiX_i.

Both methods, in contrast to naive propensity score models using only WiW_i, ensure that matching or weighting schemes balance E[XT]E[X|T], not merely E[WT]E[W|T].

4. Theoretical Properties and Inferential Guarantees

Both RC and ML estimators are constructed so that pointwise balancing equations for each subgroup gg,

i=1N(Tie^(Vi))X^ig=0,\sum_{i=1}^N (T_i - \hat{e}(V_i))\, \widehat{X}_{ig} = 0,

hold, achieving covariate balance in unobserved subgroup true means between treatment groups (Wasserman et al., 9 Dec 2025). This property reduces bias in estimators of the average treatment effect, as

Bias(τ^)E[(e^e)f(X)]\operatorname{Bias}(\hat{\tau}) \approx \mathbb{E}[(\hat{e} - e) f(X)]

for matched or weighted estimators τ^\hat{\tau}, and smaller variance in e^\hat{e} improves overlap, reducing bias from trimming or caliper matching.

In controlled subgroup selection for regression or causal effect heterogeneity, rigorous error control is achieved by sequential testing with conditionally valid tests. The chiseling procedure constructs a nested sequence of candidate regions, tests H0,t:θ(Rt)τH_{0,t}: \theta(\mathcal{R}_t) \leq \tau at level αt\alpha_t, and ensures via a sequential error-control lemma that

P(θ(S^)τ)αP(\theta(\hat S) \leq \tau) \leq \alpha

for the discovered subgroup region S^\hat{S} (Cheng et al., 23 Sep 2025). The validity over all possible paths follows from "untarnishedness"—a property ensuring the subset is i.i.d. conditional on being chosen solely by features outside the region.

5. Empirical Evaluation and Practical Impact

Simulation studies involving 500 schools and multiple subgroups show that propensity score adjustments based on RC or ML methods yield substantially improved balance and reduced bias, especially for small subgroups. For example, the unmatched rate under 1-logit caliper matching for the smallest subgroups was 2% (ML) and 5% (RC), compared to 21% for naive PS; RMSE reductions for the effect estimator were similarly pronounced (Wasserman et al., 9 Dec 2025). In weighting and penalized regression (PENCOMP), these advantages persist.

A real-world application focused on assessing the efficacy of the Texas ADSY intervention, restricted to Black and Hispanic students. Both RC and ML approaches provided better covariate balance and increased effective sample size relative to naive models, notably without needing to impute missing subgroup averages. Notably, even under these sophisticated adjustments, subgroup residual confounding can persist for unmeasured or misclassified groups, as illustrated by significant differences in placebo analyses for Asian students.

In controlled subgroup selection using chiseling, empirical comparisons in both synthetic and real datasets demonstrated that chiseling recovers subgroups with 20–150% higher expected utility than best competing valid methods, with error rates at or below nominal levels and even outperforming certain oracle baselines under some conditions (Cheng et al., 23 Sep 2025).

6. Connections, Extensions, and Methodological Considerations

Subgroup average true score adjustment intersects with other inferential strategies:

  • Simultaneous-inference and CATE-based approaches: Require parametric or VC-dimension restrictions for subgroup classes, often less feasible in high-dimensional settings.
  • Data splitting: Naive data splitting discards information and sacrifices power; chiseling, by contrast, yields valid inference with no such loss.
  • AIPW transformation: Enhances efficiency and robustness under bounded moment conditions in causal analysis by targeting average treatment effects.
  • Interpretable subgroups: Imposing constraints such as axis-aligned hyperrectangles by monotonic regression allows more interpretable subgroup definitions, especially in policy contexts (Cheng et al., 23 Sep 2025).

For practitioners, empirical-Bayes estimation of subgroup true means, high-dimensional covariate handling, and the selection of α\alpha-spending strategies are pivotal for robust application. Both the chiseling and measurement-error–aware propensity score adjustment frameworks offer theoretically and empirically validated alternatives to naive subgroup analysis, supporting rigorous group-level inference under realistic data conditions.


Key References:

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Subgroup Average True Scores.