Propensity Score Estimation

Updated 29 May 2026

Propensity score estimation is a method that models the probability of treatment assignment from observed covariates to balance groups and reduce confounding.
Classical logistic regression and advanced techniques like CBPS, IPS, and deep learning are used to achieve unbiased, efficient estimates.
Recent methods incorporate regularization, covariate balancing, and machine learning to address high-dimensionality, model misspecification, and subgroup heterogeneity.

Propensity score estimation is a cornerstone of modern causal inference, enabling unbiased estimation of treatment effects in observational data by modeling the probability of assignment to treatment given observed covariates. Originating with Rosenbaum and Rubin (1983), the propensity score—the conditional probability of treatment assignment—serves as a balancing score, allowing for reweighting, matching, stratification, and related methods to approximate randomized experiments. As empirical demands increased, so did a diverse ecosystem of estimation techniques, addressing challenges from model misspecification to high dimensionality, sampling bias, measurement error, subgroup heterogeneity, and computational scalability.

1. Fundamental Principles and Identification

Let $T\in\{0,1\}$ denote treatment assignment and $\mathbf{Z}\in\mathbb{R}^M$ the covariate vector. The propensity score is defined as $p(\mathbf{Z}) = \Pr(T=1 \mid \mathbf{Z})$ . Under strong ignorability (unconfoundedness and overlap), conditioning on $p(\mathbf{Z})$ renders $T$ independent of both observed covariates and potential outcomes, ensuring

$T \;\perp\!\!\!\perp\; \mathbf{Z} \mid p(\mathbf{Z})$

and enabling unbiased estimation of average treatment effects (ATE) or effects on the treated (ATT) through various reweighting schemes. The balancing property is foundational—any score $b(\mathbf{Z})$ that achieves this conditional independence functions as a balancing score. The canonical result is that $b(\mathbf{Z}) = p(\mathbf{Z})$ is the minimal sufficient balancing score.

2. Classical and Covariate-Balancing Estimation Methods

Maximum Likelihood and Logistic Regression

The classical approach posits a parametric model $p(\mathbf{Z};\theta)$ , typically logistic, estimated via maximum likelihood using the observed treatment assignments. The log-likelihood is

$\ell(\theta) = \sum_{i=1}^N \left\{ T_i \mathbf{Z}_i^T\theta - \log(1+\exp(\mathbf{Z}_i^T\theta)) \right\}$

yielding an MLE $\mathbf{Z}\in\mathbb{R}^M$ 0. These estimated scores form the basis for IPW, stratification, or matching.

Covariate Balancing Propensity Score (CBPS) and Optimal CBPS

CBPS and its extensions estimate the propensity score by enforcing empirical covariate balance through moment conditions: $\mathbf{Z}\in\mathbb{R}^M$ 1 for a set of functions $\mathbf{Z}\in\mathbb{R}^M$ 2. Recent work establishes that optimal choices of $\mathbf{Z}\in\mathbb{R}^M$ 3 can deliver estimators with minimal asymptotic bias under local misspecification, and when extended by sieve approximation in high dimensions, can achieve semiparametric efficiency without reliance on restrictive parametric models (Fan et al., 2021).

Integrated Propensity Score (IPS)

IPS estimation generalizes classical covariate balancing to enforce infinite-dimensional balance conditions: $\mathbf{Z}\in\mathbb{R}^M$ 4 where $\mathbf{Z}\in\mathbb{R}^M$ 5 are stabilized weights. IPS minimizes the integrated squared empirical covariate imbalance

$\mathbf{Z}\in\mathbb{R}^M$ 6

over a rich class of weight functions $\mathbf{Z}\in\mathbb{R}^M$ 7. This global approach more fully exploits covariate distribution balance compared to finite-moment CBPS and empirically yields improved finite-sample and robustness properties (Sant'Anna et al., 2018).

3. Nonparametric and Machine Learning Advances

Deep-Learning–Based Nonparametric Estimation: LBC-Net

LBC-Net achieves nonparametric estimation with a three-layer feed-forward neural network, optimizing an objective enforcing two necessary and sufficient conditions for identification:

Local balance: For a candidate score $\mathbf{Z}\in\mathbb{R}^M$ 8, $\mathbf{Z}\in\mathbb{R}^M$ 9 is conditionally independent of $p(\mathbf{Z}) = \Pr(T=1 \mid \mathbf{Z})$ 0 given $p(\mathbf{Z}) = \Pr(T=1 \mid \mathbf{Z})$ 1, enforced locally across score neighborhoods by kernel-weighted moment constraints.
Local calibration: $p(\mathbf{Z}) = \Pr(T=1 \mid \mathbf{Z})$ 2 matches local average treatment probabilities. The loss function is

$p(\mathbf{Z}) = \Pr(T=1 \mid \mathbf{Z})$ 3

with $p(\mathbf{Z}) = \Pr(T=1 \mid \mathbf{Z})$ 4 enforcing local balance, $p(\mathbf{Z}) = \Pr(T=1 \mid \mathbf{Z})$ 5 local calibration, and $p(\mathbf{Z}) = \Pr(T=1 \mid \mathbf{Z})$ 6 a tuning parameter (typically $p(\mathbf{Z}) = \Pr(T=1 \mid \mathbf{Z})$ 7). This architecture flexibly adapts to arbitrarily complex covariate–treatment relationships and circumvents parametric misspecification, yielding robust covariate balance and low bias under both correct and severe model violations (Peng et al., 2024).

Gradient Boosted Trees, Random Forests, and Ensemble Methods

Gradient boosted trees (bCART), random forests, and Super Learner/ensemble approaches have also been deployed for propensity scoring, particularly adept at high-dimensional nonlinearities and interactions. Empirical results confirm that boosting and CBPS with higher-order balancing achieve low bias and variance for a wide range of estimands (Orihara, 2022).

4. Robustness: Regularization, High-Dimensional, and Calibration Approaches

Regularized Calibration (RCAL)

Calibrated estimating equations and associated $p(\mathbf{Z}) = \Pr(T=1 \mid \mathbf{Z})$ 8-regularized losses address high-dimensional propensity score modeling. The calibration loss controls both likelihood divergence and a mean-squared relative error term directly linked to the MSE of the resulting IPW estimator. This approach, especially when combined with penalized M-estimation, yields robust, stable weights—essential for high-dimensional, possibly misspecified propensity scores (Tan, 2017).

High-Dimensional Covariate Balancing Propensity Score (HD-CBPS)

HD-CBPS couples penalized M-estimation for both PS and outcome regressions with a calibration step enforcing balance on outcome-relevant covariates. This estimator achieves semiparametric efficiency under correct specification and retains root- $p(\mathbf{Z}) = \Pr(T=1 \mid \mathbf{Z})$ 9 consistency and asymptotic normality if either model is correct, enabling valid inference in very high dimensions (Ning et al., 2018).

Calibration and Conformal Approaches

Recent work has established that probabilistic calibration of the PS model—i.e., ensuring $p(\mathbf{Z})$ 0—is necessary for unbiased IPTW and AIPW estimation. Simple recalibration via isotonic regression or Platt scaling can guarantee this property, improving finite-sample error bounds and reducing the instability due to extreme weights. This recalibration is effective for standard, high-dimensional, and even unstructured covariate settings (e.g., GWAS or image data), with empirical error reductions and computational advantages (Deshpande et al., 2023).

5. Extensions: Sampling, Missingness, Measurement Error, Subgroup Balance

Nonrandom Sampling: Oversampling, Length-Bias

In oversampled cohorts (e.g., rare exposome studies), PS estimation must adjust for induced nonidentifiability. Weighted likelihoods, using external estimates of exposure prevalence, restore identifiability and achieve asymptotic optimality. Such weighting can be applied universally to any model class that accepts sample weights, enabling flexible, consistent PS estimation in complex survey or case–control settings (Rose, 2018).

For length-biased sampling, as in prevalent cohort studies with survival outcomes, estimation via properly weighted logistic regression—where weights incorporate censoring and length-bias corrections—yields asymptotically unbiased PS estimates, outperforming both unadjusted and model-based alternatives when standard sampling assumptions are violated (Ertefaie et al., 2013).

Measurement Error in Covariates

When covariates are measured with error, naïve PS estimation using the observed proxies leads to attenuation and biased causal effect estimation. The impact depends on the correlation structure among the true confounders and measurement errors. Incorporating correctly measured auxiliary variables correlated with mismeasured confounders in the PS model can mitigate bias. Sensitivity analyses and, when possible, measurement-error–corrected or Bayesian approaches are recommended for robust inference (1706.02283).

Subgroup Covariate Balance and Heterogeneous Effects

G-SBPS and its kernelized extension (kG-SBPS) achieve simultaneous mean covariate balance in user-specified subgroups (overlapping or non-overlapping) via augmented design matrices or high-dimensional kernel features, respectively. The resulting weights enforce exact balance not just globally, but within all designated subgroups, dramatically improving subgroup effect estimation and covariate control—especially critical in studies of treatment effect heterogeneity (Li et al., 2024).

6. Practical Implementation, Algorithmic Considerations, and Empirical Findings

Common PS-based causal estimands include the ATE and ATT. Inverse probability weighting, matching, and stratification are the dominant analytic strategies. Recent comprehensive simulation studies spanning a range of model complexities confirm that stabilized IPW yields the smallest bias and MSE for the ATE in large samples, with matching and stratification providing more transparent estimands applicable to the domain of observed individuals but sometimes exhibiting attenuation or modest residual bias. Empirically, robust, balancing-motivated propensity score methods consistently outperform purely likelihood-based or naïve models, particularly under misspecification or high covariate dimensionality (Poletto et al., 2024).

Table: Major Classes of Propensity Score Estimation Techniques

Method Class	Balance Mechanism	Misspec. Robust?
Logistic/ML	Model-based (likelihood)	No
Covariate Balancing (CBPS)	Explicit moment conditions	Partial
Integrated PS (IPS)	Global functional balance	Stronger
Deep Learning (LBC-Net)	Local balance/calibration	Yes
Regularization/RCAL	Penalized calibration loss	Yes
Kernelized/kG-SBPS	RKHS-based subgroup balance	Yes
Ensemble/Boosted Trees	Flexible, data-driven fitting	Partial
Bayesian Sparse PS	Spike-and-slab selection	Yes (oracle)

Empirical and theoretical evaluations consistently show that balancing-focused and regularized methods (e.g., LBC-Net, CBPS, RCAL, HD-CBPS, kG-SBPS) provide the most robust and reliable weighting for causal inference, especially under model uncertainty, data complexity, and high-dimensional covariates.

7. Theoretical Guarantees and Double Robustness

Necessary and sufficient conditions for proper PS estimation are well characterized:

The score must satisfy local balance (conditional independence via balancing moments in neighborhoods of the score).
Local calibration ensures the score equals the true treatment probability. Constructing estimators with these properties, as in LBC-Net, achieves identification and root- $p(\mathbf{Z})$ 1–consistent estimation. The classical double robust property—the estimator is consistent if either the PS model or the outcome regression model is correct—is attained in several frameworks: augmented matching weights, calibrated information projection, and optimized CBPS estimators. Recent analyses show that optimal balancing and kernel-based approaches can match the semiparametric efficiency bound in large samples or under appropriate sieve-approximation schemes (Peng et al., 2024, Ning et al., 2018, Fan et al., 2021, Wang et al., 2023).

In summary, the estimation of propensity scores is a mature yet rapidly evolving area, incorporating classical parametric, machine learning, nonparametric, regularized, calibration-based, and robustness-oriented techniques. Advances in theory and practice increasingly favor approaches that optimize covariate balance directly, enforce local or global calibration, and adapt to model complexity and empirical challenges, ensuring valid causal inference even under challenging data-generating mechanisms.