Heterogeneous Multi-treatment Uplift Modeling
- HMUM is the estimation of individualized causal effects for several treatments using control-referenced contrasts to determine optimal interventions.
- Methodologies include decomposed X-/R-learners, uplift trees/forests, and joint neural architectures that tackle the challenges of multi-action decision making.
- Empirical studies in marketing, pricing, and platform recommendation demonstrate HMUM’s practical impact on revenue uplift and policy optimization.
Searching arXiv for recent and foundational papers on heterogeneous multi-treatment uplift modeling to ground the article. Heterogeneous Multi-treatment Uplift Modeling (HMUM) is the study of individualized causal effect estimation and treatment assignment when there are multiple candidate interventions rather than a single treatment–control contrast. In the most common discrete formulation, the objective is to estimate treatment-specific effects relative to control, such as , and then select the treatment that maximizes expected response for each covariate profile; in related extensions, the treatment space may be ordered or continuous, and the decision criterion may incorporate multiple responses, costs, or trade-offs rather than a single outcome alone (Zhao et al., 2017, Sun et al., 2024, Nandy et al., 2022, Zhai et al., 24 Nov 2025).
1. Formalization and scope
HMUM inherits its causal foundation from the potential-outcomes formulation that also unifies binary uplift modeling and conditional average treatment effect estimation. In the binary case, the core estimand is , while uplift is written as ; under overlap, SUTVA, and unconfoundedness, these coincide. The multi-treatment generalization used throughout the HMUM literature replaces with or $T\in\{0\}\cup\mathcal Z$, yielding treatment-specific contrasts relative to control, or , and individualized decision rules of the form (Zhang et al., 2020, Zhao et al., 2017, Nandy et al., 2022).
A notable empirical regularity is that most published HMUM methods are control-referenced. Multi-treatment X-/R-learner extensions, treatment-comparison causal forests, calibration pipelines, and neural architectures such as MTN all primarily estimate 0 for each active treatment 1 relative to a common baseline. When no control arm exists, some work falls back to pairwise treatment comparisons followed by majority voting rather than a unified multi-action learner (Zhao et al., 2019, Gubela et al., 2021, Park et al., 2024).
The treatment space itself varies across the literature. Some papers study multiple discrete arms such as promotions, coupons, emails, or short-video strategies; others generalize uplift estimation to ordinal or continuous treatments, learning 2 over treatment intervals or dose levels rather than only over categorical arms (Nandy et al., 2022, Wan et al., 2022). Outcome types also vary materially: methods are described for binary, discrete, and continuous responses, and recent recommendation work further treats multiple responses jointly, including APP usage time and video view counts, with personalized trade-off optimization over them (Zhao et al., 2017, Gubela et al., 2021, Zhai et al., 24 Nov 2025).
Across these settings, identification assumptions remain central. Randomized experiments are the dominant design in the directly multi-treatment papers, while observational extensions usually rely on strong ignorability and overlap and often do not provide explicit multi-arm debiasing inside the model itself (Sun et al., 2024, Nandy et al., 2022, Zhai et al., 24 Nov 2025).
2. Methodological families
The HMUM literature contains three broad design patterns: decomposed treatment-vs-control learners, direct tree/forest methods that optimize treatment selection or subgroup heterogeneity, and joint neural architectures that share information across treatments while preserving treatment-specific structure (Zhao et al., 2019, Zhao et al., 2017, Sun et al., 2024, Zhai et al., 24 Nov 2025).
Decomposed meta-learner approaches extend familiar binary estimators arm by arm. In the multi-treatment X-Learner of Zhao et al., one first estimates 3 for each arm, constructs pseudo-effects for treatment 4 versus control 5, estimates propensities 6, and combines the pseudo-effect regressions into
7
The same paper extends the R-learner idea and then selects the arm with the largest predicted uplift or net-value uplift (Zhao et al., 2019). A closely related applied variant uses separate causal forests for each treatment-versus-control comparison in continuous-outcome marketing settings, which the paper terms multiple treatment revenue uplift modeling (MT-Rev) (Gubela et al., 2021).
Direct tree and forest methods instead optimize uplift or policy value more explicitly. The CTS algorithm defines an uplift model as a policy 8, evaluates it with the unbiased estimator
9
and builds randomized trees using a split criterion based on the increase in expected response under personalized treatment assignment: 0 This makes CTS a genuine multi-treatment policy learner for randomized experiments with arbitrary response types (Zhao et al., 2017).
Later tree-based generalizations extend the treatment space itself. Generalized Causal Tree (GCT) learns joint partitions of covariate space and treatment space, estimates 1, and post-processes the learned tree into a subgroup-by-treatment table 2, where 3 partitions 4 and 5 partitions 6 (Nandy et al., 2022). Generalized Causal Forest (GCF) replaces scalar heterogeneous effects with function-valued dose-response objects and splits nodes by distances between estimated treatment-effect curves: 7 with 8 instantiated as 9, 0, or 1-type distances over treatment-effect functions (Wan et al., 2022).
Neural HMUM architectures aim to avoid both the sample fragmentation of one-vs-control decomposition and the inefficiency of naively replicating binary-treatment heads. M2TN uses Multi-gate Mixture-of-Experts for head-specific representation learning,
3
and models treatment response through an explicit uplift reparameterization,
4
so that uplift is a primitive output rather than a subtraction of two independently learned response heads (Sun et al., 2024). The short-video HMUM framework of 2025 takes a hybrid branch-wise approach: separate treatment branches estimate individual causal effects, joint optimization and a KL alignment term stabilize control representations across branches, and an online Dynamic Decision-Making module later combines multiple response uplifts with request-level weights (Zhai et al., 24 Nov 2025).
| Method family | Treatment space | Distinctive formulation |
|---|---|---|
| Multi-treatment X/R learners | Discrete arms, usually with control | Armwise pseudo-effects, propensities, plug-in argmax |
| CTS / uplift trees | Multiple discrete treatments | Split on expected-response gain under personalized assignment |
| GCT / GCF | Discrete, ordinal, or continuous | Joint 5 partitioning or function-valued dose-response splitting |
| M6TN / HUM-DDM | Multi-valued discrete treatments | Shared experts or hybrid branches with explicit uplift outputs |
3. Objectives beyond raw uplift
A central development in HMUM is the shift from raw response uplift to decision-specific utility criteria. In the equal-cost setting, the standard objective is to recommend the treatment with highest predicted uplift. When costs differ across arms, however, the relevant estimand becomes a net-value CATE rather than a response CATE. Zhao et al. introduce impression cost 7, triggered cost 8, and per-conversion value 9, define the expected net value under treatment 0 as
1
and then estimate the net-value uplift relative to control
2
In this formulation, the recommended treatment is the one with maximal expected incremental net value, not the one with maximal conversion uplift (Zhao et al., 2019).
An analogous reframing occurs when the outcome is continuous revenue rather than binary conversion. Multiple treatment revenue uplift modeling explicitly distinguishes ST-Conv, ST-Rev, MT-Conv, and MT-Rev, and evaluates treatment choice by incremental cumulative revenue rather than by response probability alone. The practical consequence is that a treatment may be favorable under conversion uplift yet unfavorable under revenue uplift, or vice versa (Gubela et al., 2021).
A different but related issue is cross-treatment comparability of scores. When HMUM is built from separately trained treatment-vs-control models, treatment-specific uplift scores may live on incompatible scales. Park et al. therefore propose calibration of the underlying meta-learner outputs together with cross-treatment score ranking and Z-score normalization, effectively comparing standardized treatment-specific scores rather than raw outputs (Park et al., 2024). This is not a new causal estimand, but it addresses a recurrent operational problem in HMUM deployment: comparing outputs across treatment-specific models without assuming scale equivalence.
Recent recommendation work pushes the objective further toward multi-response trade-off optimization. In the Kuaishou HMUM framework, the offline model first estimates branch-specific outcomes and converts them into relative uplift
3
where 4 is the averaged control estimate across treatment branches. The online decision module then estimates user- and request-specific value weights 5 and scores each treatment by
6
This replaces fixed global scalarization with personalized dynamic trade-off optimization across conflicting outcomes such as APP usage time and video view counts (Zhai et al., 24 Nov 2025).
4. Evaluation and benchmarking
Offline evaluation is unusually delicate in HMUM because the final object is often a treatment assignment rule rather than a scalar regression function. The most explicit early solution is the CTS estimator for randomized experiments. With known assignment probabilities 7, the sample average 8 of the importance-weighted variable 9 is an unbiased estimate of the expected response under policy 0. This permits both model selection and final offline policy evaluation without observing counterfactual outcomes for every arm (Zhao et al., 2017).
Ranking-based metrics remain common, but they require adaptation once there are multiple active treatments. Zhao et al. adapt AUUC to the multi-treatment-with-control setting by sorting observations by predicted uplift under the recommended treatment and, within each bin, comparing control observations with observations whose randomized treatment actually matches the recommendation. They also note that, unlike in the two-arm case, the multi-arm uplift curve can end at different cumulative values at 100% treatment because the endpoint reflects the ATE of the personalized treatment policy relative to control rather than the ATE of a single fixed treatment (Zhao et al., 2019). M1TN uses multi-treatment ranking metrics based on Qini and Kendall uplift rank correlation, specifically mQini, sdQini, mKendall, and sdKendall, thereby focusing evaluation on treatment ranking quality rather than direct estimation error (Sun et al., 2024). The Kuaishou HMUM paper uses normalized AUUC and normalized QINI, including a modified perfect-curve computation for continuous labels (Zhai et al., 24 Nov 2025).
The distinction between estimation metrics and policy metrics remains important. Semi-synthetic benchmarks can report PEHE,
2
but real HMUM systems usually care about treatment choice, policy value, net value, or cumulative revenue rather than only CATE accuracy (Diemert et al., 2021). This tension also appears in M3TN, where training uses MSE on factual outcomes while evaluation emphasizes ranking-oriented uplift metrics (Sun et al., 2024).
Benchmarking infrastructure is still asymmetrical. CRITEO-UPLIFTv2 and CRITEO-ITE provide large-scale, statistically meaningful evaluation for binary treatment uplift and ITE estimation, with 13,979,592 samples in the released uplift dataset, strong treatment imbalance, and careful sanity checks after pooling multiple randomized experiments (Diemert et al., 2021). These resources are highly useful for benchmark design, treatment-imbalance-aware evaluation, and simulator construction, but they are not themselves HMUM benchmarks because treatment remains binary throughout. The 2020 unified survey likewise provides a causal and methodological backbone for uplift/HTE modeling but explicitly focuses on a single binary treatment and does not supply a full multi-action evaluation framework (Zhang et al., 2020).
5. Applications and empirical patterns
Marketing is the canonical HMUM application area. Multi-arm experimentation with different communication channels, promotion types, coupon levels, or email variants motivated early cost-sensitive extensions of uplift modeling (Zhao et al., 2019). Revenue-focused HMUM further extends this to continuous commercial outcomes, showing that treatment choice may differ depending on whether the decision criterion is binary conversion or incremental revenue (Gubela et al., 2021). A later Best Buy campaign study uses historical multi-offer data, compares S-, T-, and X-learners under calibration, and reports that Z-score normalization improved the final per-offer assignment results from 4 and 5 lift under direct ranking to 6 and 7 lift after normalization (Park et al., 2024).
Pricing and incentive optimization provide a second major cluster of applications. CTS was validated on a priority-boarding pricing problem and on seat reservation with four price levels, where personalized pricing via CTS reached expected revenue 8 for SMA-RF and 9 for the best fixed price 0 (Zhao et al., 2017). GCF targets ordered or continuous interventions such as discount levels and was deployed in a large ride-sharing pricing system; in online A/B testing it improved finished orders by 1 in a single mobility-option strategy and 2 in a dual mobility-option strategy relative to causal forest (Wan et al., 2022).
Large-scale platform recommendation has recently become a major HMUM setting. M3TN studies multi-valued treatment uplift on a production dataset from a short-video platform with more than 7 million users, 108 user-related features, and treatments corresponding to three degrees of video sharpening plus control, assigned through randomized experiments over two weeks (Sun et al., 2024). The 2025 HMUM framework for Kuaishou models two heterogeneous recommendation strategies and two responses, APP usage time and video view counts, using offline hybrid uplift modeling and online dynamic decision-making. In online A/B tests, the Ranking-stage deployment improved APP usage time by 4, APP usage time per capita by 5, and video view counts by 6; the Edge Rerank-stage deployment improved APP usage time by 7, APP usage time per capita by 8, and video view counts by 9 (Zhai et al., 24 Nov 2025).
A recurrent empirical pattern across these applications is that naive whole-population strategies create severe trade-offs. In the Kuaishou ablation, globally applying only the consumption-side scorer improved APP usage time by 0 but reduced video view counts by 1, while globally applying only the commercialization-side scorer improved video view counts by 2 but reduced APP usage time by 3. HMUM’s value proposition in this setting is not merely better uplift estimation, but better individualized reconciliation of competing objectives (Zhai et al., 24 Nov 2025).
6. Limitations, misconceptions, and open directions
A persistent limitation is that many HMUM methods still solve the problem through one-vs-control decomposition rather than a unified all-treatment formulation. This is explicit in multi-treatment X-/R-learners, treatment-comparison causal forests, and separate-treatment calibration pipelines, and it remains true even in some applied multi-treatment revenue models (Zhao et al., 2019, Gubela et al., 2021, Park et al., 2024). The practical advantage is modularity; the statistical drawback is that treatment-vs-treatment competition is handled only indirectly, and score comparability becomes a serious issue.
A second limitation concerns causal identification outside randomized experiments. CTS, M4TN, and the short-video HMUM framework are most naturally situated in RCT or near-randomized settings; GCT is derived under randomized assumptions even though the paper remarks that inverse propensity weighting could remove the complete-randomization assumption; GCF adds doubly robust estimation for continuous treatment but still requires consistency, ignorability, and positivity, with continuous-treatment overlap being materially stronger than ordinary multi-arm overlap (Zhao et al., 2017, Sun et al., 2024, Nandy et al., 2022, Wan et al., 2022). This suggests that observational HMUM remains methodologically less mature than randomized HMUM.
A third issue is objective mismatch. Some models train with factual MSE but are judged by uplift ranking metrics; some optimize conversion yet are deployed where net value or revenue matters; and in multi-response systems fixed scalarization weights can create biased decisions because user preferences differ over time (Sun et al., 2024, Zhao et al., 2019, Zhai et al., 24 Nov 2025). Recent work addresses parts of this mismatch through net-value pseudo-effects, revenue uplift, or request-level weighting, but a unified theory connecting HMUM estimation, cross-treatment ranking, and downstream policy value is still incomplete.
Two common misconceptions are clarified by the literature. First, HMUM is not simply “binary uplift with more heads”: several papers explicitly argue that naive extensions of binary architectures can be parameter-inefficient, induce cumulative error, or ignore treatment competition (Sun et al., 2024, Zhai et al., 24 Nov 2025). Second, not every paper labeled as uplift or feature selection is in fact relevant to HMUM. The record associated with “Feature Selection Methods for Uplift Modeling and Heterogeneous Treatment Effect” (Zhao et al., 2020) contains no direct technical content on uplift, causal inference, treatment effects, or feature selection; its subject is sparse Gaussian elimination for linear systems, so it is not a methodological source for HMUM (Zhao et al., 2020).
Taken together, these constraints suggest three durable research directions. The first is unified multi-action modeling beyond one-vs-control decomposition, especially when several active treatments compete directly. The second is stronger observational identification and evaluation for multi-arm decisions, rather than reliance on randomized data or heuristic pairwise comparisons. The third is better integration of multi-response decision objectives with causal effect estimation, so that calibration, ranking, policy value, and trade-off optimization are aligned within a single HMUM framework rather than handled in disconnected stages (Zhang et al., 2020, Diemert et al., 2021, Zhai et al., 24 Nov 2025).