Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
101 tokens/sec
Gemini 2.5 Pro Premium
50 tokens/sec
GPT-5 Medium
28 tokens/sec
GPT-5 High Premium
27 tokens/sec
GPT-4o
101 tokens/sec
DeepSeek R1 via Azure Premium
90 tokens/sec
GPT OSS 120B via Groq Premium
515 tokens/sec
Kimi K2 via Groq Premium
220 tokens/sec
2000 character limit reached

Consistency of Neural Causal Partial Identification (2405.15673v3)

Published 24 May 2024 in cs.LG, cs.AI, and stat.ML

Abstract: Recent progress in Neural Causal Models (NCMs) showcased how identification and partial identification of causal effects can be automatically carried out via training of neural generative models that respect the constraints encoded in a given causal graph [Xia et al. 2022, Balazadeh et al. 2022]. However, formal consistency of these methods has only been proven for the case of discrete variables or only for linear causal models. In this work, we prove the consistency of partial identification via NCMs in a general setting with both continuous and categorical variables. Further, our results highlight the impact of the design of the underlying neural network architecture in terms of depth and connectivity as well as the importance of applying Lipschitz regularization in the training phase. In particular, we provide a counterexample showing that without Lipschitz regularization this method may not be asymptotically consistent. Our results are enabled by new results on the approximability of Structural Causal Models (SCMs) via neural generative models, together with an analysis of the sample complexity of the resulting architectures and how that translates into an error in the constrained optimization problem that defines the partial identification bounds.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. Identification and estimation of local average treatment effects, 1995.
  2. Neural network learning: Theoretical foundations, volume 9. cambridge university press Cambridge, 1999.
  3. Partial identification of treatment effects with implicit generative models, October 2022.
  4. Counterfactual probabilities: Computational methods, bounds and applications. In Uncertainty Proceedings 1994, pages 46–54. Elsevier, 1994.
  5. Bounds on treatment effects from studies with imperfect compliance. Journal of the American Statistical Association, 92(439):1171–1176, September 1997.
  6. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20(63):1–17, 2019.
  7. Alexis Bellot. Towards bounding causal effects under markov equivalence. arXiv preprint arXiv:2311.07259, 2023.
  8. Blai Bonet. Instrumentality tests revisited, January 2013.
  9. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14(11), 2013.
  10. CLIP: Cheap Lipschitz Training of Neural Networks. In Abderrahim Elmoataz, Jalal Fadili, Yvain Quéau, Julien Rabin, and Loïc Simon, editors, Scale Space and Variational Methods in Computer Vision, volume 12679, pages 307–319, Cham, 2021. Springer International Publishing.
  11. Estimation and confidence regions for parameter sets in econometric models 1. Econometrica, 75(5):1243–1284, 2007.
  12. Faster wasserstein distance estimation with the sinkhorn divergence. Advances in Neural Information Processing Systems, 33:2257–2269, 2020.
  13. Parseval networks: Improving robustness to adversarial examples. In International conference on machine learning, pages 854–863. PMLR, 2017.
  14. An automated approach to causal inference in discrete settings, September 2021.
  15. Jean Feydy. Geometric data analysis, beyond convolutions. Applied Mathematics, 2020.
  16. Interpolating between optimal transport and mmd using sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2681–2690, 2019.

Summary

  • The paper proves that Neural Causal Models consistently approximate SCMs for partial identification in mixed-variable settings using Lipschitz regularization.
  • It introduces novel neural network architectures—both wide and deep—to approximate complex latent distributions and structural functions with minimal error.
  • Experiments validate that the approach yields near-optimal causal effect bounds and improves computational efficiency over discretization-based methods.

This paper, "Consistency of Neural Causal Partial Identification" (2405.15673), addresses the challenge of establishing formal consistency for Neural Causal Models (NCMs) in partial identification of causal effects, particularly for settings involving both continuous and categorical variables. While NCMs have shown promise in automatically deriving identification and partial identification bounds by training neural generative models constrained by a causal graph [xia2022neural, balazadehPartialIdentificationTreatment2022], their consistency was previously proven only for discrete variables or linear causal models.

The core problem is to find the maximum and minimum values of a target causal quantity θ(M)\theta(\mathcal{M}) over all Structural Causal Models (SCMs) M\mathcal{M} that are consistent with the observed data distribution PM(V)P^{\mathcal{M}^*}(\boldsymbol{V}) and a given causal graph GM\mathcal{G}_{\mathcal{M}^*}. This is formulated as:

maxMC/minMCθ(M)\max_{\mathcal{M}\in \mathcal{C}} / \min_{\mathcal{M}\in \mathcal{C}} \quad \theta(\mathcal{M})

subject to PM(V)=PM(V)P^{\mathcal{M}}(\boldsymbol{V}) = P^{\mathcal{M}^*}(\boldsymbol{V}) and GM=GM\mathcal{G}_\mathcal{M} =\mathcal{G}_{\mathcal{M}^*}.

The paper makes several key contributions:

  • Approximation of SCMs by NCMs: It demonstrates that under suitable regularity assumptions (Lipschitz continuity of structural functions, bounded variables), any Lipschitz SCM with continuous or categorical variables can be approximated by an NCM. This approximation ensures that the Wasserstein distance between any interventional distribution of the NCM and the original SCM is small. The paper specifies two neural network architectures (wide and deep) for this purpose that are trainable via gradient-based methods (Theorems 1, 2, 3 and Corollary 1).
  • Novel Representation Theorem for Probability Measures: A new representation theorem (Proposition 1) is developed, showing that under certain conditions, probability distributions on the unit cube can be simulated by pushing forward a multivariate uniform distribution using Hölder continuous curves. This result is instrumental in constructing the NCM approximations.
  • Importance of Lipschitz Regularization: The authors highlight the critical role of Lipschitz regularization by providing a counterexample (Proposition 2) where the NCM approach fails to be asymptotically consistent without such regularization.
  • Consistency of NCM-based Partial Identification: Using Lipschitz regularization, the paper proves the consistency of the partial identification bounds obtained via NCMs (Theorem 4). This means that as the sample size increases, the estimated bounds converge appropriately.

Implementing NCMs for Partial Identification

1. Defining Neural Causal Models (NCMs)

An SCM is defined by a set of observed variables V\boldsymbol{V}, latent variables U\boldsymbol{U}, structural equations Vi=fi(Pa(Vi),UVi)V_i = f_i(\text{Pa}(V_i), \boldsymbol{U}_{V_i}), a distribution over latents P(U)P(\boldsymbol{U}), and a causal graph G\mathcal{G}. An NCM is a specific type of SCM where:

  • Latent variables U\boldsymbol{U} are i.i.d. multivariate uniform variables (for continuous parts) and i.i.d. Gumbel variables (for generating categorical variables via Gumbel-max trick).
  • The structural functions fif_i are implemented as Neural Networks (NNs).

2. Approximating Structural Causal Models

The paper proposes a canonical representation for SCMs where latent variables correspond to C2C^2-components (maximal sets of variables connected by bi-directed arrows, indicating shared unobserved confounders). The goal is to approximate an SCM M\mathcal{M}^* with an NCM M^\hat{\mathcal{M}}.

Theorem 1 decomposes the approximation error (in Wasserstein-1 distance) between the interventional distributions PM(V(t))P^{\mathcal{M}^*}(\boldsymbol{V}(\boldsymbol{t})) and PM^(V^(t))P^{\hat{\mathcal{M}}}(\hat{\boldsymbol{V}}(\boldsymbol{t})):

W(PM(V(t)),PM^(V^(t)))CG(L,K)(i=1nVfif^i+W(PM(U),PM^(U^)))W(P^{\mathcal{M}^*}(\boldsymbol{V}(\boldsymbol{t})), P^{\hat{\mathcal{M}}}(\hat{\boldsymbol{V}}(\boldsymbol{t}))) \le C_{\mathcal{G}}(L,K) \left( \sum_{i=1}^{n_V} \| f_i - \hat{f}_i \|_\infty + W(P^{\mathcal{M}^*}(\boldsymbol{U}), P^{\hat{\mathcal{M}}}(\hat{\boldsymbol{U}})) \right)

This means the error depends on:

  • How well the NNs f^i\hat{f}_i approximate the true structural functions fif_i.
  • How well the NCM's latent variable distributions (generated from uniform noise via NNs g^j\hat{g}_j) approximate the true SCM's latent distributions.

The structural equations for the approximating NCM M^\hat{\mathcal{M}} take the form:

V^i={f^i(Pa(V^i),(g^j(ZCj))UCjUVi),Vi is continuous argmaxk[ni]{Gk+log(f^i(Pa(V^i),(g^j(ZCj))UCjUVi))k},Vi is categorical\hat{V}_{i} =\begin{cases} \hat{f}_{i}\left(\text{Pa}(\hat{V}_{i}), (\hat{g}_{j}(Z_{C_{j}}))_{U_{C_{j}} \in \boldsymbol{U}_{V_{i}}}\right), & V_{i} \text{ is continuous} \ \arg\max_{k\in [n_{i}]}\left\{G_{k} +\log\left(\hat{f}_{i}\left(\text{Pa}(\hat{V}_{i}), (\hat{g}_{j}(Z_{C_{j}}))_{U_{C_{j}} \in \boldsymbol{U}_{V_{i}}}\right)\right)_{k}\right\}, & V_{i} \text{ is categorical} \end{cases}

where ZCjZ_{C_j} are i.i.d. uniform variables, and f^i,g^j\hat{f}_i, \hat{g}_j are neural networks.

3. Neural Network Architectures for Latent Distribution Approximation

The paper details how to construct the NNs g^j\hat{g}_j to approximate complex latent variable distributions UjU_j. This involves pushing forward uniform variables ZCjZ_{C_j} through NNs.

  • Assumption for Approximation (Mixed Distribution, Assumption 3): The support of the target latent distribution P(Uj)\mathbb{P}(U_j) has finitely many connected components CkC_k, each of which is Lipschitz homeomorphic to a unit cube [0,1]dkC[0,1]^{d_k^C}.
  • Wide Neural Network Architecture (Theorem 2, Figure 1):

1. Space-filling Curve Approximation: For each connected component CkC_k of the latent variable's support, its preimage (Hk1)#P(H_k^{-1})_{\#}\mathbb{P} on the unit cube [0,1]dkC[0,1]^{d_k^C} is approximated. A wide NN g^1k\hat{g}_1^k (constant depth, width W1W_1) maps a uniform variable ZkU[0,1]Z_k \sim U[0,1] to approximate this distribution on [0,1]dkC[0,1]^{d_k^C}. This uses results on high-dimensional distribution generation with NNs [perekrestenkoHighDimensionalDistributionGeneration2022a]. 2. Lipschitz Homeomorphism Approximation: An NN g^2k\hat{g}_2^k (depth L2L_2) approximates the Lipschitz homeomorphism HkH_k that maps [0,1]dkC[0,1]^{d_k^C} back to the actual component CkC_k. 3. Gumbel-Softmax Layer: The outputs from different components are combined. If the original latent variable is a mixture over NCN_C components with probabilities pk=P(Ck)p_k = \mathbb{P}(C_k), the Gumbel-Softmax trick is used to select one component's output. This allows for differentiable sampling from the mixture.

X^iτ=exp((logpi+Gi)/τ)k=1NCexp((logpk+Gk)/τ)\hat{X}^\tau_i = \frac{\exp((\log p_i +G_i)/\tau)}{\sum_{k=1}^{N_C}\exp((\log p_k +G_k)/\tau)}

The final output is a weighted sum (effectively a selection when τ0\tau \to 0) of the component-specific generations. The Wasserstein approximation error is O(W11/maxi{diC}+L22/maxi{diC}+(ττlogτ))O(W_1^{-1/\max_i\{d_i^C\}} + L_2^{-2/\max_i\{d_i^C\}} + (\tau - \tau\log\tau)).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Pseudocode for Wide NN Latent Generation
# Input: Z_uniform (uniform noise for each component), G_gumbel (Gumbel noise)
# Parameters: nn_g1_k (for component k), nn_g2_k (for component k), log_p_k (log probs for Gumbel-Softmax)

component_outputs = []
for k in range(N_C): # For each connected component
    # Part 1: Approximate distribution on unit cube
    on_cube_k = nn_g1_k(Z_uniform[k])
    # Part 2: Map to actual component support
    output_k = nn_g2_k(on_cube_k)
    component_outputs.append(output_k)

# Part 3: Gumbel-Softmax to select/combine components
V = stack(component_outputs) # Shape [N_C, d_latent]
softmax_weights = gumbel_softmax(log_p_k, G_gumbel, temperature_tau) # Shape [N_C]
final_latent_sample = einsum("k,kd -> d", softmax_weights, V) # Weighted sum / selection

return final_latent_sample

  • Deep Neural Network Architecture (Theorem 3, Figure in Appendix fig:deep_nn):
    • Additional Assumption (Lower Bound, Assumption 4): The probability measure P\mathbb{P} on each component, when pulled back to a unit cube [0,1]d[0,1]^d by a Lipschitz homeomorphism ff, has a density with respect to the Lebesgue measure that is lower bounded (f#1P(B)Cfλ(B)f^{-1}_{\#}\mathbb{P}(B) \ge C_f \lambda(B)).
    • Under this, Proposition 1 shows there's a $1/d$-Hölder continuous curve γ:[0,1][0,1]d\gamma: [0,1] \to [0,1]^d such that γ#λ=f#1P\gamma_{\#}\lambda = f^{-1}_{\#}\mathbb{P}.
    • Deep ReLU NNs are known to approximate Hölder continuous functions efficiently. The first part of the wide NN architecture (g^1k\hat{g}_1^k) is replaced by a deep NN (depth L1L_1, constant width Θ(djC)\Theta(d_j^C)) approximating this Hölder curve γk\gamma_k.
    • The Wasserstein approximation error becomes O(L12/maxi{diC}+L22/maxi{diC}+(ττlogτ))O(L_1^{-2/\max_i\{d_i^C\}} + L_2^{-2/\max_i\{d_i^C\}} + (\tau - \tau\log\tau)). This potentially offers better rates for a given number of parameters (O(N2/d)O(N^{-2/d}) vs O(N1/d)O(N^{-1/d}) for wide NNs, where NN is number of weights).

Corollary 1 combines these results to state the overall approximation error for an NCM approximating an SCM using deep NNs for latent variables and standard NNs for structural functions:

W(PM(V(t)),PM^(V^(t)))O(L02/dinmax+L12/dUmax+L22/dUmax+(ττlogτ))W(P^{\mathcal{M}^*}(\boldsymbol{V}(\boldsymbol{t})), P^{\hat{\mathcal{M}}}(\hat{\boldsymbol{V}}(\boldsymbol{t}))) \le O(L_0^{-2/d_{in}^{\max}} + L_1^{-2/d_U^{\max}} + L_2^{-2/d_U^{\max}} + (\tau - \tau\log\tau))

where L0L_0 is depth for fif_i NNs, L1,L2L_1, L_2 for gjg_j NNs.

4. Consistency and Lipschitz Regularization

In practice, the partial identification problem is solved using empirical distributions:

minM^NCMnEtμTEM^[F(V(t))]\min_{\hat{\mathcal{M}} \in \text{NCM}_n} \quad \mathbb{E}_{t \sim \mu_T} \mathbb{E}_{\hat{\mathcal{M}}} [F(\boldsymbol{V}(t))]

subject to Sλn(PmnM^(V),PnM(V))αnS_{\lambda_n}(P_{m_n}^{\hat{\mathcal{M}}}(\boldsymbol{V}), P_n^{\mathcal{M}^*}(\boldsymbol{V})) \le \alpha_n. Here, SλnS_{\lambda_n} is the Sinkhorn distance, PnMP_n^{\mathcal{M}^*} is the empirical data distribution (sample size nn), and PmnM^P_{m_n}^{\hat{\mathcal{M}}} is the empirical distribution from NCM samples (sample size mnm_n). NCMn\text{NCM}_n denotes NCMs with increasing complexity (depth/width) as nn grows.

  • Counterexample (Proposition 2): Without regularization on the NCM, consistency may not hold. There can be an identifiable SCM M\mathcal{M}^* and a sequence of SCMs Mϵ\mathcal{M}_\epsilon such that W(PM(V),PMϵ(V))ϵW(P^{\mathcal{M}^*}(\boldsymbol{V}), P^{\mathcal{M}_\epsilon}(\boldsymbol{V})) \le \epsilon but ATEMATEMϵ>c>0|\text{ATE}_{\mathcal{M}^*} - \text{ATE}_{\mathcal{M}_\epsilon}| > c > 0. This occurs if Mϵ\mathcal{M}_\epsilon has exploding Lipschitz constants.
  • Lipschitz Regularized NNs: To ensure consistency, the NNs in the NCM must have controlled Lipschitz constants. The paper defines a class of truncated Lipschitz NNs: NNd1,d2Lf,K(W,L)={max{K,min{f,K}}:fNNd1,d2(W,L),Lip(f)Lf}\mathcal{NN}_{d_1,d_2}^{L_f,K}(W,L) = \{\max\{-K, \min\{f,K\}\} : f \in \mathcal{NN}_{d_1,d_2}(W,L), \text{Lip}(f) \le L_f \}. Techniques from [cisse2017parseval, virmauxLipschitzRegularityDeep2018a, goukRegularisationNeuralNetworks2021, pauliTrainingRobustNeural2021, bungertCLIPCheapLipschitz2021] can be used to enforce this during training.
  • Consistency Theorem (Theorem 4):

    If M\mathcal{M}^* satisfies the assumptions for NCM approximation (Corollary 1), the target functional FF is Lipschitz, and the NCMs are constructed using Lipschitz-regularized NNs (specifically, wide NNs for structural functions f^i\hat{f}_i to control their Lipschitz constant, and potentially deep NNs for latent generators g^j\hat{g}_j where Lipschitz control is on the target function, not the NN itself directly, but the output is bounded), then with probability 1, the solution FnF_n to the empirical problem satisfies:

    [lim infnFn,lim supnFn][FL^f,F][\liminf_{n \to \infty} F_n, \limsup_{n \to \infty} F_n] \subset [\underline{F}^{\hat{L}_f}, F_*]

    where FF_* is the true causal quantity from M\mathcal{M}^*, and FL^f\underline{F}^{\hat{L}_f} is the true lower bound of the partial identification problem when SCMs are restricted to have structural functions with Lipschitz constant L^f\hat{L}_f. L^f\hat{L}_f might be slightly larger than the true SCM's LfL_f due to NN approximation properties. This means the NCM approach provides a valid (though potentially slightly wider than sharp if L^f>Lf\hat{L}_f > L_f) lower bound. If the quantity is point-identified, FnFF_n \to F_*.

  • Finite Sample Rate for ATE (No Confounding): Proposition 3 establishes Hölder continuity of ATE under no confounding and overlap. Corollary 2 leverages this to give a finite-sample convergence rate FnFO(αn)|F_n - F_*| \le O(\sqrt{\alpha_n}) for the ATE in this special setting.

5. Experiments

The NCM approach was tested on:

  • Binary IV setting: Compared to Autobounds [duarteAutomatedApproachCausal2021]. NCM bounds were close to optimal and comparable to Autobounds.
  • Continuous IV setting: Treatment binary, other variables continuous. NCMs were compared to Autobounds (which required discretization of continuous variables). NCMs provided tighter bounds, likely because discretization for Autobounds becomes computationally challenging for fine grids.

The experiments used feed-forward NNs, ALM for optimization, Sinkhorn distance via "geomloss" [feydy2019interpolating], and Lipschitz regularization via layer-wise weight normalization [gouk2021regularisation]. The Wasserstein ball radius αn\alpha_n was set using a subsampling technique.

Practical Implications and Deployment:

  • Architecture Choice: The paper provides concrete NN architectural guidance (Figures 1 and Appendix Figure fig:deep_nn) for modeling complex latent distributions, combining specialized NN blocks with Gumbel-Softmax for mixtures.
  • Regularization is Key: Implementers must use Lipschitz regularization on the NNs representing structural functions to ensure consistent bounds.
  • Training and Optimization: Standard gradient-based methods can be used. The Sinkhorn distance is preferred for its computational benefits. The Gumbel-Softmax trick enables end-to-end differentiability for models with categorical choices.
  • Handling Mixed Data: The framework naturally handles SCMs with both continuous and categorical variables by design.
  • Computational Cost: While NCMs can be computationally intensive to train, they might scale better to continuous problems than methods requiring fine discretization, as seen in the experiments. The complexity of NNs (width/depth) needs to be increased with sample size for consistency.

Limitations:

  • The consistency result (Theorem 4) shows convergence to an interval whose lower bound FL^f\underline{F}^{\hat{L}_f} depends on the Lipschitz constant L^f\hat{L}_f achievable by the NNs while approximating the true functions with Lipschitz constant LfL_f. If L^f>Lf\hat{L}_f > L_f, the bound might be slightly conservative.
  • The choice of hyperparameters (network size, αn\alpha_n, λn\lambda_n, τn\tau_n) needs careful tuning.

In summary, this paper provides significant theoretical grounding for using NCMs in causal partial identification for general variable types. It offers practical architectural insights and underscores the necessity of Lipschitz regularization for achieving consistent results.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube