Meta-Learners: Frameworks for Learning to Learn
- Meta-learners are meta-algorithmic frameworks that decompose complex learning tasks into simpler supervised subproblems, facilitating rapid adaptation when data is scarce.
- They enable estimation of conditional effects through techniques like S-, T-, and X-learners to solve causal inference problems by rearranging standard regression tasks.
- These frameworks find applications in few-shot learning, weakly-supervised segmentation, and adversarial online learning, offering robust performance even in complex task distributions.
A meta-learner is a meta-algorithmic framework that leverages the solution of simpler subproblems—often standard supervised learning or regression problems—to perform “learning to learn” at the task or distributional level. The goal is to extract, transfer, or regularize statistical structure across tasks or environments, and thereby achieve rapid adaptation, robust estimation, or efficient predictive inference, even when base task data are limited or the underlying task distribution is complex. Meta-learners are prominent in modern causal inference, few-shot learning, weakly/sparsely-supervised segmentation, Bayesian sequence learning, adversarial online learning, and foundational theoretical studies of multi-task and representation learning.
1. Canonical Meta-Learner Frameworks in Causal Inference
Meta-learners in the context of heterogeneous treatment effect estimation decompose the conditional average treatment effect (CATE) problem—which cannot be directly posed as standard supervised learning—into one or several regression subproblems solvable by any regression or machine learning base learner (e.g., random forests (RF), Bayesian additive regression trees (BART), neural nets) (Künzel et al., 2017).
Let observed data be i.i.d. units with covariates , binary treatment indicator , and observed outcome . The CATE is defined as .
Major meta-learners:
- S-Learner: Fit a single regression model for on . Estimate CATE as . Pools all data, but can bias toward zero if is a weak predictor.
- T-Learner: Fit two separate models, using treated units, using controls. Compute CATE as . Handles highly non-overlapping response functions, but does not borrow strength when CATE is structurally simpler than the base outcome functions.
X-Learner: A 3-stage procedure that exploits both shared structure and unbalanced designs:
- Fit , by base learners.
- Impute pseudo-outcomes (difference-in-differences): (D_i{(1)}=Y_i-\hat{\mu}_