Consistency of Neural Causal Partial Identification (2405.15673v3)

Published 24 May 2024 in cs.LG, cs.AI, and stat.ML

Abstract: Recent progress in Neural Causal Models (NCMs) showcased how identification and partial identification of causal effects can be automatically carried out via training of neural generative models that respect the constraints encoded in a given causal graph [Xia et al. 2022, Balazadeh et al. 2022]. However, formal consistency of these methods has only been proven for the case of discrete variables or only for linear causal models. In this work, we prove the consistency of partial identification via NCMs in a general setting with both continuous and categorical variables. Further, our results highlight the impact of the design of the underlying neural network architecture in terms of depth and connectivity as well as the importance of applying Lipschitz regularization in the training phase. In particular, we provide a counterexample showing that without Lipschitz regularization this method may not be asymptotically consistent. Our results are enabled by new results on the approximability of Structural Causal Models (SCMs) via neural generative models, together with an analysis of the sample complexity of the resulting architectures and how that translates into an error in the constrained optimization problem that defines the partial identification bounds.

References (16)

Summary

The paper proves that Neural Causal Models consistently approximate SCMs for partial identification in mixed-variable settings using Lipschitz regularization.
It introduces novel neural network architectures—both wide and deep—to approximate complex latent distributions and structural functions with minimal error.
Experiments validate that the approach yields near-optimal causal effect bounds and improves computational efficiency over discretization-based methods.

This paper, "Consistency of Neural Causal Partial Identification" (2405.15673), addresses the challenge of establishing formal consistency for Neural Causal Models (NCMs) in partial identification of causal effects, particularly for settings involving both continuous and categorical variables. While NCMs have shown promise in automatically deriving identification and partial identification bounds by training neural generative models constrained by a causal graph [xia2022neural, balazadehPartialIdentificationTreatment2022], their consistency was previously proven only for discrete variables or linear causal models.

The core problem is to find the maximum and minimum values of a target causal quantity $\theta(\mathcal{M})$ over all Structural Causal Models (SCMs) $\mathcal{M}$ that are consistent with the observed data distribution $P^{\mathcal{M}^*}(\boldsymbol{V})$ and a given causal graph $\mathcal{G}_{\mathcal{M}^*}$ . This is formulated as:

$\max_{\mathcal{M}\in \mathcal{C}} / \min_{\mathcal{M}\in \mathcal{C}} \quad \theta(\mathcal{M})$

subject to $P^{\mathcal{M}}(\boldsymbol{V}) = P^{\mathcal{M}^*}(\boldsymbol{V})$ and $\mathcal{G}_\mathcal{M} =\mathcal{G}_{\mathcal{M}^*}$ .

The paper makes several key contributions:

Approximation of SCMs by NCMs: It demonstrates that under suitable regularity assumptions (Lipschitz continuity of structural functions, bounded variables), any Lipschitz SCM with continuous or categorical variables can be approximated by an NCM. This approximation ensures that the Wasserstein distance between any interventional distribution of the NCM and the original SCM is small. The paper specifies two neural network architectures (wide and deep) for this purpose that are trainable via gradient-based methods (Theorems 1, 2, 3 and Corollary 1).
Novel Representation Theorem for Probability Measures: A new representation theorem (Proposition 1) is developed, showing that under certain conditions, probability distributions on the unit cube can be simulated by pushing forward a multivariate uniform distribution using Hölder continuous curves. This result is instrumental in constructing the NCM approximations.
Importance of Lipschitz Regularization: The authors highlight the critical role of Lipschitz regularization by providing a counterexample (Proposition 2) where the NCM approach fails to be asymptotically consistent without such regularization.
Consistency of NCM-based Partial Identification: Using Lipschitz regularization, the paper proves the consistency of the partial identification bounds obtained via NCMs (Theorem 4). This means that as the sample size increases, the estimated bounds converge appropriately.

Implementing NCMs for Partial Identification

1. Defining Neural Causal Models (NCMs)

An SCM is defined by a set of observed variables $\boldsymbol{V}$ , latent variables $\boldsymbol{U}$ , structural equations $V_i = f_i(\text{Pa}(V_i), \boldsymbol{U}_{V_i})$ , a distribution over latents $P(\boldsymbol{U})$ , and a causal graph $\mathcal{G}$ . An NCM is a specific type of SCM where:

Latent variables $\boldsymbol{U}$ are i.i.d. multivariate uniform variables (for continuous parts) and i.i.d. Gumbel variables (for generating categorical variables via Gumbel-max trick).
The structural functions $f_i$ are implemented as Neural Networks (NNs).

2. Approximating Structural Causal Models

The paper proposes a canonical representation for SCMs where latent variables correspond to $C^2$ -components (maximal sets of variables connected by bi-directed arrows, indicating shared unobserved confounders). The goal is to approximate an SCM $\mathcal{M}^*$ with an NCM $\hat{\mathcal{M}}$ .

Theorem 1 decomposes the approximation error (in Wasserstein-1 distance) between the interventional distributions $P^{\mathcal{M}^*}(\boldsymbol{V}(\boldsymbol{t}))$ and $P^{\hat{\mathcal{M}}}(\hat{\boldsymbol{V}}(\boldsymbol{t}))$ :

$W(P^{\mathcal{M}^*}(\boldsymbol{V}(\boldsymbol{t})), P^{\hat{\mathcal{M}}}(\hat{\boldsymbol{V}}(\boldsymbol{t}))) \le C_{\mathcal{G}}(L,K) \left( \sum_{i=1}^{n_V} \| f_i - \hat{f}_i \|_\infty + W(P^{\mathcal{M}^*}(\boldsymbol{U}), P^{\hat{\mathcal{M}}}(\hat{\boldsymbol{U}})) \right)$

This means the error depends on:

How well the NNs $\hat{f}_i$ approximate the true structural functions $f_i$ .
How well the NCM's latent variable distributions (generated from uniform noise via NNs $\hat{g}_j$ ) approximate the true SCM's latent distributions.

The structural equations for the approximating NCM $\hat{\mathcal{M}}$ take the form:

$\hat{V}_{i} =\begin{cases} \hat{f}_{i}\left(\text{Pa}(\hat{V}_{i}), (\hat{g}_{j}(Z_{C_{j}}))_{U_{C_{j}} \in \boldsymbol{U}_{V_{i}}}\right), & V_{i} \text{ is continuous} \ \arg\max_{k\in [n_{i}]}\left\{G_{k} +\log\left(\hat{f}_{i}\left(\text{Pa}(\hat{V}_{i}), (\hat{g}_{j}(Z_{C_{j}}))_{U_{C_{j}} \in \boldsymbol{U}_{V_{i}}}\right)\right)_{k}\right\}, & V_{i} \text{ is categorical} \end{cases}$

where $Z_{C_j}$ are i.i.d. uniform variables, and $\hat{f}_i, \hat{g}_j$ are neural networks.

3. Neural Network Architectures for Latent Distribution Approximation

The paper details how to construct the NNs $\hat{g}_j$ to approximate complex latent variable distributions $U_j$ . This involves pushing forward uniform variables $Z_{C_j}$ through NNs.

Assumption for Approximation (Mixed Distribution, Assumption 3): The support of the target latent distribution $\mathbb{P}(U_j)$ has finitely many connected components $C_k$ , each of which is Lipschitz homeomorphic to a unit cube $[0,1]^{d_k^C}$ .
Wide Neural Network Architecture (Theorem 2, Figure 1):

1. Space-filling Curve Approximation: For each connected component $C_k$ of the latent variable's support, its preimage $(H_k^{-1})_{\#}\mathbb{P}$ on the unit cube $[0,1]^{d_k^C}$ is approximated. A wide NN $\hat{g}_1^k$ (constant depth, width $W_1$ ) maps a uniform variable $Z_k \sim U[0,1]$ to approximate this distribution on $[0,1]^{d_k^C}$ . This uses results on high-dimensional distribution generation with NNs [perekrestenkoHighDimensionalDistributionGeneration2022a]. 2. Lipschitz Homeomorphism Approximation: An NN $\hat{g}_2^k$ (depth $L_2$ ) approximates the Lipschitz homeomorphism $H_k$ that maps $[0,1]^{d_k^C}$ back to the actual component $C_k$ . 3. Gumbel-Softmax Layer: The outputs from different components are combined. If the original latent variable is a mixture over $N_C$ components with probabilities $p_k = \mathbb{P}(C_k)$ , the Gumbel-Softmax trick is used to select one component's output. This allows for differentiable sampling from the mixture.

$\hat{X}^\tau_i = \frac{\exp((\log p_i +G_i)/\tau)}{\sum_{k=1}^{N_C}\exp((\log p_k +G_k)/\tau)}$

The final output is a weighted sum (effectively a selection when $\tau \to 0$ ) of the component-specific generations. The Wasserstein approximation error is $O(W_1^{-1/\max_i\{d_i^C\}} + L_2^{-2/\max_i\{d_i^C\}} + (\tau - \tau\log\tau))$ .

# Pseudocode for Wide NN Latent Generation
# Input: Z_uniform (uniform noise for each component), G_gumbel (Gumbel noise)
# Parameters: nn_g1_k (for component k), nn_g2_k (for component k), log_p_k (log probs for Gumbel-Softmax)

component_outputs = []
for k in range(N_C): # For each connected component
    # Part 1: Approximate distribution on unit cube
    on_cube_k = nn_g1_k(Z_uniform[k])
    # Part 2: Map to actual component support
    output_k = nn_g2_k(on_cube_k)
    component_outputs.append(output_k)

# Part 3: Gumbel-Softmax to select/combine components
V = stack(component_outputs) # Shape [N_C, d_latent]
softmax_weights = gumbel_softmax(log_p_k, G_gumbel, temperature_tau) # Shape [N_C]
final_latent_sample = einsum("k,kd -> d", softmax_weights, V) # Weighted sum / selection

return final_latent_sample

Deep Neural Network Architecture (Theorem 3, Figure in Appendix fig:deep_nn):
- Additional Assumption (Lower Bound, Assumption 4): The probability measure $\mathbb{P}$ on each component, when pulled back to a unit cube $[0,1]^d$ by a Lipschitz homeomorphism $f$ , has a density with respect to the Lebesgue measure that is lower bounded ( $f^{-1}_{\#}\mathbb{P}(B) \ge C_f \lambda(B)$ ).
- Under this, Proposition 1 shows there's a $1/d$-Hölder continuous curve $\gamma: [0,1] \to [0,1]^d$ such that $\gamma_{\#}\lambda = f^{-1}_{\#}\mathbb{P}$ .
- Deep ReLU NNs are known to approximate Hölder continuous functions efficiently. The first part of the wide NN architecture ( $\hat{g}_1^k$ ) is replaced by a deep NN (depth $L_1$ , constant width $\Theta(d_j^C)$ ) approximating this Hölder curve $\gamma_k$ .
- The Wasserstein approximation error becomes $O(L_1^{-2/\max_i\{d_i^C\}} + L_2^{-2/\max_i\{d_i^C\}} + (\tau - \tau\log\tau))$ . This potentially offers better rates for a given number of parameters ( $O(N^{-2/d})$ vs $O(N^{-1/d})$ for wide NNs, where $N$ is number of weights).

Corollary 1 combines these results to state the overall approximation error for an NCM approximating an SCM using deep NNs for latent variables and standard NNs for structural functions:

$W(P^{\mathcal{M}^*}(\boldsymbol{V}(\boldsymbol{t})), P^{\hat{\mathcal{M}}}(\hat{\boldsymbol{V}}(\boldsymbol{t}))) \le O(L_0^{-2/d_{in}^{\max}} + L_1^{-2/d_U^{\max}} + L_2^{-2/d_U^{\max}} + (\tau - \tau\log\tau))$

where $L_0$ is depth for $f_i$ NNs, $L_1, L_2$ for $g_j$ NNs.

4. Consistency and Lipschitz Regularization

In practice, the partial identification problem is solved using empirical distributions:

$\min_{\hat{\mathcal{M}} \in \text{NCM}_n} \quad \mathbb{E}_{t \sim \mu_T} \mathbb{E}_{\hat{\mathcal{M}}} [F(\boldsymbol{V}(t))]$

subject to $S_{\lambda_n}(P_{m_n}^{\hat{\mathcal{M}}}(\boldsymbol{V}), P_n^{\mathcal{M}^*}(\boldsymbol{V})) \le \alpha_n$ . Here, $S_{\lambda_n}$ is the Sinkhorn distance, $P_n^{\mathcal{M}^*}$ is the empirical data distribution (sample size $n$ ), and $P_{m_n}^{\hat{\mathcal{M}}}$ is the empirical distribution from NCM samples (sample size $m_n$ ). $\text{NCM}_n$ denotes NCMs with increasing complexity (depth/width) as $n$ grows.

Counterexample (Proposition 2): Without regularization on the NCM, consistency may not hold. There can be an identifiable SCM $\mathcal{M}^*$ and a sequence of SCMs $\mathcal{M}_\epsilon$ such that $W(P^{\mathcal{M}^*}(\boldsymbol{V}), P^{\mathcal{M}_\epsilon}(\boldsymbol{V})) \le \epsilon$ but $|\text{ATE}_{\mathcal{M}^*} - \text{ATE}_{\mathcal{M}_\epsilon}| > c > 0$ . This occurs if $\mathcal{M}_\epsilon$ has exploding Lipschitz constants.
Lipschitz Regularized NNs: To ensure consistency, the NNs in the NCM must have controlled Lipschitz constants. The paper defines a class of truncated Lipschitz NNs: $\mathcal{NN}_{d_1,d_2}^{L_f,K}(W,L) = \{\max\{-K, \min\{f,K\}\} : f \in \mathcal{NN}_{d_1,d_2}(W,L), \text{Lip}(f) \le L_f \}$ . Techniques from [cisse2017parseval, virmauxLipschitzRegularityDeep2018a, goukRegularisationNeuralNetworks2021, pauliTrainingRobustNeural2021, bungertCLIPCheapLipschitz2021] can be used to enforce this during training.
Consistency Theorem (Theorem 4):

If $\mathcal{M}^*$ satisfies the assumptions for NCM approximation (Corollary 1), the target functional $F$ is Lipschitz, and the NCMs are constructed using Lipschitz-regularized NNs (specifically, wide NNs for structural functions $\hat{f}_i$ to control their Lipschitz constant, and potentially deep NNs for latent generators $\hat{g}_j$ where Lipschitz control is on the target function, not the NN itself directly, but the output is bounded), then with probability 1, the solution $F_n$ to the empirical problem satisfies:

$[\liminf_{n \to \infty} F_n, \limsup_{n \to \infty} F_n] \subset [\underline{F}^{\hat{L}_f}, F_*]$

where $F_*$ is the true causal quantity from $\mathcal{M}^*$ , and $\underline{F}^{\hat{L}_f}$ is the true lower bound of the partial identification problem when SCMs are restricted to have structural functions with Lipschitz constant $\hat{L}_f$ . $\hat{L}_f$ might be slightly larger than the true SCM's $L_f$ due to NN approximation properties. This means the NCM approach provides a valid (though potentially slightly wider than sharp if $\hat{L}_f > L_f$ ) lower bound. If the quantity is point-identified, $F_n \to F_*$ .
Finite Sample Rate for ATE (No Confounding): Proposition 3 establishes Hölder continuity of ATE under no confounding and overlap. Corollary 2 leverages this to give a finite-sample convergence rate $|F_n - F_*| \le O(\sqrt{\alpha_n})$ for the ATE in this special setting.

5. Experiments

The NCM approach was tested on:

Binary IV setting: Compared to Autobounds [duarteAutomatedApproachCausal2021]. NCM bounds were close to optimal and comparable to Autobounds.
Continuous IV setting: Treatment binary, other variables continuous. NCMs were compared to Autobounds (which required discretization of continuous variables). NCMs provided tighter bounds, likely because discretization for Autobounds becomes computationally challenging for fine grids.

The experiments used feed-forward NNs, ALM for optimization, Sinkhorn distance via "geomloss" [feydy2019interpolating], and Lipschitz regularization via layer-wise weight normalization [gouk2021regularisation]. The Wasserstein ball radius $\alpha_n$ was set using a subsampling technique.

Practical Implications and Deployment:

Architecture Choice: The paper provides concrete NN architectural guidance (Figures 1 and Appendix Figure fig:deep_nn) for modeling complex latent distributions, combining specialized NN blocks with Gumbel-Softmax for mixtures.
Regularization is Key: Implementers must use Lipschitz regularization on the NNs representing structural functions to ensure consistent bounds.
Training and Optimization: Standard gradient-based methods can be used. The Sinkhorn distance is preferred for its computational benefits. The Gumbel-Softmax trick enables end-to-end differentiability for models with categorical choices.
Handling Mixed Data: The framework naturally handles SCMs with both continuous and categorical variables by design.
Computational Cost: While NCMs can be computationally intensive to train, they might scale better to continuous problems than methods requiring fine discretization, as seen in the experiments. The complexity of NNs (width/depth) needs to be increased with sample size for consistency.

Limitations:

The consistency result (Theorem 4) shows convergence to an interval whose lower bound $\underline{F}^{\hat{L}_f}$ depends on the Lipschitz constant $\hat{L}_f$ achievable by the NNs while approximating the true functions with Lipschitz constant $L_f$ . If $\hat{L}_f > L_f$ , the bound might be slightly conservative.
The choice of hyperparameters (network size, $\alpha_n$ , $\lambda_n$ , $\tau_n$ ) needs careful tuning.

In summary, this paper provides significant theoretical grounding for using NCMs in causal partial identification for general variable types. It offers practical architectural insights and underscores the necessity of Lipschitz regularization for achieving consistent results.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Consistency of Neural Causal Partial Identification (2405.15673v3)

Summary

Follow-up Questions

Authors (3)

Tweets

Don't miss out on important new AI/ML research

Consistency of Neural Causal Partial Identification (2405.15673v3)

Summary

Follow-up Questions

Related Papers

Authors (3)

Tweets

Don't miss out on important new AI/ML research