Randomization Inference For the Always-Reporter Average Treatment Effect

Published 26 Mar 2026 in econ.EM and stat.ME | (2603.24970v2)

Abstract: This article studies randomization inference for treatment effects in randomized controlled trials with attrition, where outcomes are observed for only a subset of units. We assume monotonicity in reporting behavior as in \cite{lee2009training} and focus on the average treatment effect for always-reporters (AR-ATE), defined as units whose outcomes are observed under both treatment and control. Because always-reporter status is only partially revealed by observed assignment and response patterns, we propose a worst-case randomization test that maximizes the randomization p-value over all always-reporter configurations consistent with the data, with an optional pretest to prune implausible configurations. Using studentized Hajek- and chi-square-type statistics, we show the resulting procedure is finite-sample valid for the sharp null and asymptotically valid for the weak null. We also discuss computational implementations for discrete outcomes and integer-programming-based bounds for continuous outcomes.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper develops a randomization inference method for estimating the average treatment effect among always-reporters under monotone attrition in RCTs.
It introduces a worst-case testing approach that maximizes p-values over feasible always-reporter configurations using integer programming strategies.
Empirical simulations confirm correct size and high power, demonstrating both finite-sample validity under the sharp null and asymptotic control under weak nulls.

Randomization Inference for the Always-Reporter Average Treatment Effect: A Technical Essay

Introduction

This paper develops a finite-sample valid randomization-based inference framework for the average treatment effect (ATE) among always-reporters (AR-ATE) in randomized controlled trials (RCTs) with attrition. Under monotone response selection (treatment can only weakly increase reporting probability), the AR-ATE is partially identified—being the effect on units who would report regardless of assignment. This estimation problem is fundamental in field and clinical experiments, political economy, and policy evaluation, where missing outcome data governed by treatment-induced nonresponse is typical. The contribution is to provide randomization-based hypothesis testing for sharp and weak nulls on the AR-ATE, with algorithms that maximize statistical validity under finite samples and weak identifiability.

Problem Setup and Identification

The key setting involves $n$ randomized units, each with a pair of observed variables: the observed treatment assignment $D_i$ and post-assignment outcome reporting indicator $R_i$ . The monotonicity assumption $r_i(1)\geq r_i(0)$ (no unit reports under control but not under treatment) induces three principal strata: always-reporters, if-reporters, and never-reporters, as in principal stratification.

The target parameter is the average treatment effect among always-reporters: $\text{AR-ATE} = \frac{1}{|\mathcal{A}|} \sum_{i \in \mathcal{A}} \left[y_i(1) - y_i(0)\right],$ where $\mathcal{A}$ is the set of units who would report under both treatment and control.

Standard identification results—building on Lee [lee2009training]—establish that:

The share of always-reporters is identified by the reporting rate in controls.
The mean potential control outcome for always-reporters is point-identified by observed reporters in controls.
The mean treated outcome for always-reporters is only bounded, via monotonicity and trimming arguments.

However, always-reporter status is only partially revealed by the data. Thus, inference for AR-ATE, and even for the sharp null $y_i(1)=y_i(0)$ over always-reporters, is nontrivial.

Randomization-Based Inference: Methodology

If always-reporter status were observed, randomization inference for the AR-ATE (e.g., permuting assignments, evaluating difference-of-means in the always-reporter subset) would be straightforward. The challenge arises from the fact that in any observed data, always-reporters in the treated group are only a latent subset of observed reporters.

Worst-Case Approach

The core methodological innovation is a "worst-case" randomization test. For any compatible labeling of always-reporters consistent with observed $D_i$ and $R_i$ , a randomization $p$ -value can be computed as if the true truth table were known. The overall procedure is then:

Take the maximum $D_i$ 0-value across all feasible always-reporter configurations compatible with the observed data.
Optionally, prune implausible always-reporter sets using a pretest on the marginal balance of always-reporters across treatment arms.

The finite-sample validity of the resulting test immediately follows from these two properties:

Under the sharp null, the randomization distribution is uniform in finite samples, even when maximized over configurations.
Under weak nulls (the average effect among always-reporters is zero, but heterogeneity is allowed), asymptotic control is established using technical analysis of the joint statistics.

Test Statistics and Implementation

The main statistics used are:

Studentized Hajek-type difference-in-means (with unknown, potentially unbalanced denominators).
Joint $D_i$ 1-type statistics testing both mean balance and the balance in the number of always-reporters.

For discrete outcomes, symmetry and the small number of support points allow exhaustive enumeration; for continuous outcomes, the p-value computation is relaxed to a sequence of integer programming (IP) problems, with variable assignments over the always-reporter matching variables.

For computational tractability, the search is efficiently organized by principal strata logic, and the algorithms are designed to exploit problem symmetries—such as labeling invariance among observed treated reporters.

Statistical Validity and Theoretical Guarantees

Finite-sample validity is proven for the sharp null, irrespective of the composition of the finite population or the outcome distribution [(2603.24970), aronow2024randomization]. This is achieved via complete randomization (CR), regardless of the actual always-reporter labeling (because the maximization is over all data-compatibles).

Asymptotic validity for the weak null is established using combinatorial CLT machinery and recent Berry–Esseen bounds for conditional randomization [shi2022berry, li2017general]. The critical technical feature is that the procedure is uniformly valid even when the number of always-reporters is nearly the population size (unlike classical Lee/Imbens bounds, which rely on limiting shares of each stratum being strongly away from 0/1).

The analysis also accommodates non-regular cases, with the asymptotic null distribution of the test statistic potentially degenerate depending on the sequence of asymptotic regimes for the principal strata shares. Asymptotic approximation (i.e., using normal/chi-square critical values) is justified, and its limits carefully characterized.

Pretesting/pruning via balance of always-reporters (Berger–Boos) is shown to reduce computational burden without compromising statistical validity, provided that significance level adjustment is done correctly.

Computational Aspects and Algorithms

The randomization inference problem is inherently computationally challenging, since, in the worst case, the feasible set of always-reporter configurations grows exponentially with sample size (as $D_i$ 2, where $D_i$ 3 is treated reporters). For binary or categorical outcomes, the problem reduces to grouping by support points and can be solved in polynomial time in $D_i$ 4 (given small support). In the continuous case, the integer programming relaxation solves the maximization over feasible always-reporter matchings for each Monte Carlo permutation and outcome configuration efficiently.

In practice, the approach deploys heuristic lower bounding in a first stage (for potential rejection), and only continues to exact enumeration or IP computation if the decision boundary is close.

Empirical Performance: Simulation Results

Simulations confirm theoretical size and power properties. For sample size $D_i$ 5 (with $D_i$ 6 treatment assignments), correct test size is maintained (empirical rejection under null 0.002 at $D_i$ 7). Power under a unit effect ( $D_i$ 8) is 0.922. The runtime is manageable (median seconds to hours), with the main computational burden arising in marginal rejection cases.

Implications, Extensions, and Future Directions

This work rigorously establishes a design-based, randomization-inference protocol for AR-ATE under monotone attrition in RCTs. Strong claims include finite-sample level control under the sharp null, and asymptotic control of size under weak nulls in all principal strata regimes (including near-degenerate cases).

The implications are significant where attrition or missing data may confound classical (i.e., intention-to-treat or per-protocol) analyses. The method is directly extensible to settings with more nuanced principal stratification but requires extension for settings with covariate-dependent monotonicity (as in "Generalized Lee Bounds" [semenova2025generalized]) or complex assignment mechanisms beyond complete randomization.

Algorithmically, future work should address further scalability (possibly through sharper relaxations or approximations), especially for larger-scale trials and/or under richer outcome supports. Theoretical extension to randomized assignment with stratification or re-randomization procedures is natural.

Conclusion

The paper delivers a rigorous and computationally formulated framework for randomization inference on the always-reporter ATE, under monotonic sample selection. It advances the literature by guaranteeing level and asymptotic validity, addressing computational issues for general outcomes, and providing efficient tests with guaranteed coverage in finite populations—offering a template for principled missing-data adjusted inference in modern experimental designs.

Markdown Report Issue