Papers
Topics
Authors
Recent
Search
2000 character limit reached

Nonconvex NegDRO: Robust Causal Invariance

Updated 10 June 2026
  • The paper introduces NegDRO, which leverages negative-weight nonconvex minimax optimization to reliably recover causal outcome models under additive interventions.
  • It establishes identifiability conditions ensuring the sole linear predictor with equal risk across environments is the true causal coefficient under strict heterogeneity.
  • The proposed gradient-based algorithm scales efficiently in high dimensions, outperforming exhaustive combinatorial searches found in traditional causal discovery methods.

Nonconvex Negative-Weight Distributionally Robust Optimization (NegDRO) is a continuous minimax framework for causal invariance learning, introduced to address causal discovery across heterogenous environments under additive interventions. NegDRO extends classical group distributionally robust optimization by maximizing over weights that may be negative—breaking convexity—yet under appropriate identifiability conditions, it provably recovers the causal outcome model and exhibits strong computational properties. Unlike prior approaches, NegDRO avoids exhaustive combinatorial searches by leveraging nonconvex optimization with theoretical guarantees, scaling efficiently with the number of covariates and maintaining robustness where prior methods fail (Wang et al., 2024).

1. Formulation and Nonconvex Minimax Structure

NegDRO operates on multi-environment data E={1,,E}\mathcal{E} = \{1, \dots, E\}, where each environment ee presents a squared-loss linear prediction risk for bRpb\in\mathbb{R}^p as

Re(b)=E[(Y(e)bX(e))2].R_e(b) = \mathbb{E}\bigl[(Y^{(e)} - b^\top X^{(e)})^2\bigr].

To characterize invariance via risk-equalization across environments, NegDRO defines an uncertainty set for environment weights parameterized by γ0\gamma \ge 0:

U(γ)={wRE:e=1Ewe=1,  mineweγ}.\mathcal{U}(\gamma) = \left\{w \in \mathbb{R}^E \,:\, \sum_{e=1}^E w_e = 1,\; \min_e w_e \ge -\gamma\right\}.

The central optimization is

bNegγ=argminbRpmaxwU(γ)e=1EweRe(b).(1)b_{\mathrm{Neg}}^\gamma = \underset{b \in \mathbb{R}^p}{\arg\min} \underset{w \in \mathcal{U}(\gamma)}{\max} \sum_{e=1}^E w_e R_e(b). \tag{1}

With γ\gamma \to \infty, invariance is strictly enforced: bNeg=argminb:R1(b)==RE(b)Re(b).(2)b_{\mathrm{Neg}}^\infty = \underset{b: R_1(b)=\cdots=R_E(b)}{\arg\min} R_e(b). \tag{2} Allowing negative wew_e leads to nonconvexity as ee0 lacks positive semidefinite curvature in ee1 when any ee2. However, under specific identifiability conditions, all stationary points are globally optimal. This property distinguishes NegDRO from standard convex-concave DRO formulations (Wang et al., 2024).

2. Identifiability Conditions under Additive Interventions

The identifiability results assume a linear structural equations model (SEM) on ee3, with

ee4

where the environment-specific model is ee5. Heterogeneity is introduced by

ee6

with the requirement ee7, so all environment-variation is in ee8.

Condition A (strict heterogeneity) prescribes: ee9 which is both sufficient and, in the case where each bRpb\in\mathbb{R}^p0 is one-sparse, nearly necessary for identifiability. The only linear predictor that achieves equal risk across environments is the true causal coefficient bRpb\in\mathbb{R}^p1. Thus, bRpb\in\mathbb{R}^p2 under Condition A (Wang et al., 2024).

3. Gradient-Based Optimization Algorithm

To address non-differentiability in the maximization over bRpb\in\mathbb{R}^p3, NegDRO employs a ridge-regularized objective for bRpb\in\mathbb{R}^p4, yielding a differentiable function bRpb\in\mathbb{R}^p5: bRpb\in\mathbb{R}^p6 for small bRpb\in\mathbb{R}^p7. The unique maximizer bRpb\in\mathbb{R}^p8 permits a single-loop algorithm alternating between

  1. Weight maximization: bRpb\in\mathbb{R}^p9
  2. Gradient descent in Re(b)=E[(Y(e)bX(e))2].R_e(b) = \mathbb{E}\bigl[(Y^{(e)} - b^\top X^{(e)})^2\bigr].0: Re(b)=E[(Y(e)bX(e))2].R_e(b) = \mathbb{E}\bigl[(Y^{(e)} - b^\top X^{(e)})^2\bigr].1, where the gradient is

Re(b)=E[(Y(e)bX(e))2].R_e(b) = \mathbb{E}\bigl[(Y^{(e)} - b^\top X^{(e)})^2\bigr].2

The final estimate Re(b)=E[(Y(e)bX(e))2].R_e(b) = \mathbb{E}\bigl[(Y^{(e)} - b^\top X^{(e)})^2\bigr].3 is the iterate with minimal Re(b)=E[(Y(e)bX(e))2].R_e(b) = \mathbb{E}\bigl[(Y^{(e)} - b^\top X^{(e)})^2\bigr].4. A proximal or subgradient-based variant can be applied to the unpenalized objective. This iterative method consistently avoids the exponential cost of exhaustive search present in ICP, EILLS, and related invariant causal discovery approaches (Wang et al., 2024).

4. Theoretical Guarantees

Assuming Re(b)=E[(Y(e)bX(e))2].R_e(b) = \mathbb{E}\bigl[(Y^{(e)} - b^\top X^{(e)})^2\bigr].5-Lipschitz gradients for each Re(b)=E[(Y(e)bX(e))2].R_e(b) = \mathbb{E}\bigl[(Y^{(e)} - b^\top X^{(e)})^2\bigr].6 and denoting by Re(b)=E[(Y(e)bX(e))2].R_e(b) = \mathbb{E}\bigl[(Y^{(e)} - b^\top X^{(e)})^2\bigr].7 the minimal eigenvalue in Condition A and Re(b)=E[(Y(e)bX(e))2].R_e(b) = \mathbb{E}\bigl[(Y^{(e)} - b^\top X^{(e)})^2\bigr].8 (minimal sample size per environment), NegDRO provides the following guarantees:

  • Population bound for any Re(b)=E[(Y(e)bX(e))2].R_e(b) = \mathbb{E}\bigl[(Y^{(e)} - b^\top X^{(e)})^2\bigr].9: γ0\gamma \ge 00
  • Finite-sample bound (with high probability): γ0\gamma \ge 01
  • Stationary-point convergence: For step-size γ0\gamma \ge 02 and γ0\gamma \ge 03 steps,

γ0\gamma \ge 04

so

γ0\gamma \ge 05

Proximal or subgradient variants achieve the γ0\gamma \ge 06 rate to a generalized stationary point.

In limited-intervention regimes (only outcome-children perturbed), only a principal submatrix of γ0\gamma \ge 07 needs positivity, with degraded rates γ0\gamma \ge 08 and much longer required iteration γ0\gamma \ge 09. These theoretical results elucidate NegDRO's ability to attain causal identification and robust convergence in both population and finite-sample regimes (Wang et al., 2024).

5. Practical Performance and Empirical Insights

Simulation studies highlight multiple salient aspects of NegDRO's practical efficacy:

  • Convergence in U(γ)={wRE:e=1Ewe=1,  mineweγ}.\mathcal{U}(\gamma) = \left\{w \in \mathbb{R}^E \,:\, \sum_{e=1}^E w_e = 1,\; \min_e w_e \ge -\gamma\right\}.0: For large U(γ)={wRE:e=1Ewe=1,  mineweγ}.\mathcal{U}(\gamma) = \left\{w \in \mathbb{R}^E \,:\, \sum_{e=1}^E w_e = 1,\; \min_e w_e \ge -\gamma\right\}.1, the error in the estimate scales as U(γ)={wRE:e=1Ewe=1,  mineweγ}.\mathcal{U}(\gamma) = \left\{w \in \mathbb{R}^E \,:\, \sum_{e=1}^E w_e = 1,\; \min_e w_e \ge -\gamma\right\}.2.
  • Sample-size scaling: The estimation error decreases as U(γ)={wRE:e=1Ewe=1,  mineweγ}.\mathcal{U}(\gamma) = \left\{w \in \mathbb{R}^E \,:\, \sum_{e=1}^E w_e = 1,\; \min_e w_e \ge -\gamma\right\}.3 (empirically, slope ≈ -1/4 on log–log plots).
  • High-dimensional scalability: NegDRO solves problems with up to U(γ)={wRE:e=1Ewe=1,  mineweγ}.\mathcal{U}(\gamma) = \left\{w \in \mathbb{R}^E \,:\, \sum_{e=1}^E w_e = 1,\; \min_e w_e \ge -\gamma\right\}.4 covariates within seconds to minutes. In contrast, exhaustive search methods such as ICP and EILLS fail to complete within 30 minutes for U(γ)={wRE:e=1Ewe=1,  mineweγ}.\mathcal{U}(\gamma) = \left\{w \in \mathbb{R}^E \,:\, \sum_{e=1}^E w_e = 1,\; \min_e w_e \ge -\gamma\right\}.5–U(γ)={wRE:e=1Ewe=1,  mineweγ}.\mathcal{U}(\gamma) = \left\{w \in \mathbb{R}^E \,:\, \sum_{e=1}^E w_e = 1,\; \min_e w_e \ge -\gamma\right\}.6.
  • Robustness to intervention strength: When interventions are limited or weak, classical methods such as CausalDantzig (requires invertible Gram-matrix gaps) and DRIG (requires a reference environment) fail, but NegDRO still recovers U(γ)={wRE:e=1Ewe=1,  mineweγ}.\mathcal{U}(\gamma) = \left\{w \in \mathbb{R}^E \,:\, \sum_{e=1}^E w_e = 1,\; \min_e w_e \ge -\gamma\right\}.7.
  • Negative weights as invariance-enforcing: Allowing U(γ)={wRE:e=1Ewe=1,  mineweγ}.\mathcal{U}(\gamma) = \left\{w \in \mathbb{R}^E \,:\, \sum_{e=1}^E w_e = 1,\; \min_e w_e \ge -\gamma\right\}.8 enables the optimizer to subtract non-causal environment risks, enforcing risk invariance; despite nonconvexity, simple gradient-based schemes reliably converge globally (Wang et al., 2024).

6. Relation to Prior Work and Significance

NegDRO generalizes classical group DRO, which constrains U(γ)={wRE:e=1Ewe=1,  mineweγ}.\mathcal{U}(\gamma) = \left\{w \in \mathbb{R}^E \,:\, \sum_{e=1}^E w_e = 1,\; \min_e w_e \ge -\gamma\right\}.9 to the simplex (bNegγ=argminbRpmaxwU(γ)e=1EweRe(b).(1)b_{\mathrm{Neg}}^\gamma = \underset{b \in \mathbb{R}^p}{\arg\min} \underset{w \in \mathcal{U}(\gamma)}{\max} \sum_{e=1}^E w_e R_e(b). \tag{1}0), as in Sagawa et al. (Sagawa et al., 2019), but surmounts the limitations posed by convexity. Earlier invariance-based methods (e.g., ICP, EILLS) involve combinatorial searches over covariate subsets with exponential complexity, substantially limiting scalability. CausalDantzig [Rothenhäusler et al., Ann. Stat. 2019] and DRIG further rely on restrictive identifiability conditions (e.g., invertibility or reference environments), failing in weak or limited-intervention settings. NegDRO, by combining negative weighting and nonconvex minimax optimization, achieves polynomial scalability and theoretical recovery guarantees in a broad array of intervention regimes, significantly broadening the applicability of causal invariance approaches (Wang et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Nonconvex Negative-Weight Distributionally Robust Optimization (NegDRO).