Multi-Domain Causal Representation Learning via Weak Distributional Invariances (2310.02854v3)

Published 4 Oct 2023 in cs.LG and stat.ML

Abstract: Causal representation learning has emerged as the center of action in causal machine learning research. In particular, multi-domain datasets present a natural opportunity for showcasing the advantages of causal representation learning over standard unsupervised representation learning. While recent works have taken crucial steps towards learning causal representations, they often lack applicability to multi-domain datasets due to over-simplifying assumptions about the data; e.g. each domain comes from a different single-node perfect intervention. In this work, we relax these assumptions and capitalize on the following observation: there often exists a subset of latents whose certain distributional properties (e.g., support, variance) remain stable across domains; this property holds when, for example, each domain comes from a multi-node imperfect intervention. Leveraging this observation, we show that autoencoders that incorporate such invariances can provably identify the stable set of latents from the rest across different settings.

References (45)

Citations (7)

View on Semantic Scholar

Summary

The paper introduces a method that leverages weak distributional invariances to disentangle stable causal factors across multiple domains.
It relaxes strong assumptions by allowing for imperfect interventions and flexible causal graphs, achieving block-affine identification.
Empirical results across various datasets demonstrate significant improvements in representation learning under diverse domain shifts.

This paper, "Multi-Domain Causal Representation Learning via Weak Distributional Invariances" (2310.02854), introduces a method for learning causal representations from unlabelled multi-domain data by leveraging weak distributional invariances. The core idea is that in many real-world scenarios, while some aspects of the data change across domains (e.g., background, style), others remain stable (e.g., object identity, certain physical properties). The proposed approach aims to identify and disentangle these stable latent factors from the unstable ones.

The authors relax common strong assumptions in causal representation learning, such as requiring perfect single-node interventions or a fixed causal graph (Directed Acyclic Graph - DAG) for all data points. Instead, they focus on the observation that a subset of latent variables might exhibit stable distributional properties (like support or variance) across different domains, even under multi-node imperfect interventions.

Problem Statement and Approach

The data generation process (DGP) is defined as follows: for each domain $j$ out of $k$ domains, latent variables $z \in \mathbb{R}^d$ are sampled from a domain-specific distribution $p_Z^{(j)}$ . These latents are then transformed by an injective mixing function $g: \mathbb{R}^d \rightarrow \mathbb{R}^n$ to produce observations $x \in \mathbb{R}^n$ .

$z \sim p_Z^{(j)}, \quad x \leftarrow g(z)$

The goal is to learn an encoder $f: \mathbb{R}^n \rightarrow \mathbb{R}^d$ such that its output $\hat{z} = f(x)$ is a good estimate of the true latent $z$ . This is typically done by training an autoencoder $(f,h)$ (where $h$ is the decoder) to satisfy the reconstruction identity $h \circ f(x) = x$ .

The key innovation is to divide the latent components $z$ into a stable set $\mathcal{S}$ and an unstable set $\mathcal{U}$ , so $z = [z_{\mathcal{S}}, z_{\mathcal{U}}]$ . The principle is that some functional $F$ of the marginal distribution of $z_{\mathcal{S}}$ , i.e., $F[p_{z_{\mathcal{S}}^{(j)}}]$ , remains invariant across domains $j$ . The proposed autoencoders incorporate this by enforcing a similar invariance on a subset $\hat{\mathcal{S}}$ of the estimated latents $\hat{z}$ :

$h \circ f(x) = x, \quad \forall x \in \mathcal{X}$

$F[p_{\hat{z}_{\hat{\mathcal{S}}}^{(p)}}] = F[p_{\hat{z}_{\hat{\mathcal{S}}}^{(q)}}], \quad \forall p \neq q, p,q \in [k]$

The learner can find a suitable $\hat{\mathcal{S}}$ by starting with the largest possible set and reducing its size until a solution satisfying both reconstruction and invariance is found.

Theoretical Identification Guarantees

The paper provides theoretical guarantees for identifying the stable latents $z_{\mathcal{S}}$ under different assumptions about the latent distribution $p_Z$ and the mixing function $g$ .

1. Acyclic Structural Causal Models for $p_Z$

It's initially assumed that latents $p_Z$ come from an acyclic Structural Causal Model (SCM). The approach first leverages prior results (Theorem 1, from (Ciliberto, 2020)) that show autoencoders with polynomial decoders (Constraint \ref{assm3: h_poly_new}) and polynomial mixing functions (Assumption \ref{assm1: dgp1}) achieve affine identification: $\hat{z} = Az + c$ .

Single-Node Imperfect Interventions (Theorem 2):
- DGP: Latents are generated by $z_i^{(j)} \leftarrow q_i(z_{\mathrm{Pa}(i)}^{(j)}) + \varrho_i^{(j)}$ , where noise $\varrho_i^{(j)}$ can change across domains for $i \in \mathcal{U}$ .
- Assumption \ref{assm: imp_int_structure}: Only one node in $\mathcal{U}$ is imperfectly intervened in each interventional domain. Nodes in $\mathcal{S}$ are never intervened. Children of any node in $\mathcal{U}$ must also be in $\mathcal{U}$ .
- Constraint \ref{assm: dist_inv} (Marginal Invariance): The marginal distribution $p_{\hat{z}_i^{(p)}}$ is enforced to be the same across domains for each $i \in \hat{\mathcal{S}}$ .
- Result: Achieves block-affine identification, $\hat{z}_{\hat{\mathcal{S}}} = D z_{\mathcal{S}} + e$ , meaning the learned stable latents are an affine transformation of the true stable latents, disentangled from $z_{\mathcal{U}}$ .
Multi-Node Imperfect Interventions (Theorem 3):
- Assumption \ref{assm: multi_int_str}: Allows for imperfect interventions on multiple nodes in $\mathcal{U}$ simultaneously. Assumes Gaussian noise $\varrho_i$ with variances sampled i.i.d. from a non-atomic density. Requires sufficient random multi-node interventions.
- Result: With high probability (if the number of interventions $t$ is large enough, scaling with $d\log(d/\delta)$ ), block-affine identification $\hat{z}_{\hat{\mathcal{S}}} = D z_{\mathcal{S}} + e$ is achieved.

2. General Distributions $p_Z$ (Relaxing Fixed DAG)

This section studies scenarios where a single fixed DAG might not describe the relationships between latents across all data. A weaker form of invariance, marginal support invariance, is considered.

Polynomial Mixing (Theorem 4):
- Assumption \ref{assm: supp_invar}: The minimum and maximum of each true latent $z_i$ for $i \in \mathcal{S}$ are invariant across domains.
- Assumption \ref{assm: sup_var} (Support Variability): There exist two domains $p, q$ such that for each $z \in \mathcal{Z}^{(p)}$ , there's a $z' \in \mathcal{Z}^{(q)}$ where $z'_i \geq z_i$ for all $i$ , and $z'_j > z_j$ for unstable components $j \in \mathcal{U}$ .
- Constraint \ref{assm: supp_inv} (Marginal Support Invariance): The min/max of learned latents $\hat{z}_i$ for $i \in \hat{\mathcal{S}}$ are enforced to be invariant.
- Result: If the affine transformation $A_i$ (from $\hat{z}_i = A_i^T z + c_i$ ) is in the positive orthant ( $A_i \succcurlyeq 0$ ), then $A_{ir}=0$ for all $r \in \mathcal{U}$ . This means $\hat{z}_i$ only depends on $z_{\mathcal{S}}$ .
- Implementation Consideration: Extending this beyond the positive orthant requires checking all $2^d$ orthants, potentially needing $2^{d+1}$ domains. This can be reduced to $d$ domains if the support is a polytope and satisfies certain diversity conditions (Appendix \ref{sec:poly}).
General Diffeomorphisms (Theorem 5 - illustrated with two variables):
- Considers $z = [z_1, z_2]$ where $z_1$ 's support is invariant ( $[0,1]$ ) and $z_2$ 's support varies.
- Definition \ref{def: lipschitz_a}: Defines a class of functions $\Gamma$ (parameterized by $\theta$ ) where the global minimum over $[0,1]\times[0,1]$ is significantly different (by $\eta$ ) from the minimum when $z_2$ is constrained to certain sub-intervals. Functions that depend only on $z_1$ are not in $\Gamma$ .
- Assumption \ref{assm: diverse_int} (Support Variability for $z_2$ ): Requires that the support of $z_2$ in randomly drawn domains has a certain probability of being contained in small intervals or covering large portions like $[\kappa, 1-\kappa]$ .
- Result ( $\Gamma^c$ identification): If enough diverse domains are sampled (number $k \geq N(\delta, \varepsilon, \eta, \iota)$ ), the learned map $a_1(\cdot)$ (where $\hat{z}_1 = a_1(z_1, z_2)$ ) will not belong to $\Gamma$ . This pushes $a_1(\cdot)$ towards being a function of only $z_1$ .
- Implementation Consideration: The number of required domains $N$ depends on properties of the function class $\Gamma$ and the diversity of supports.

Learning Invariance-Constrained Representations in Practice

A two-stage learning procedure is proposed:

Stage 1: Train an initial autoencoder $(\tilde{f}, \tilde{h})$ to minimize reconstruction error: $\mathbb{E}[\|h \circ f(x) - x\|^2]$ . Let $\tilde{x} = \tilde{f}(x)$ be the output of this stage's encoder.
Stage 2: Train a second autoencoder $(f^{\star}, h^{\star})$ $(f^{⋆}, h^{⋆})$ using $\tilde{x}$ $\tilde{x}$ as input. The objective combines reconstruction error with a penalty term for violating the invariance:

$\mathbb{E}[\|h^{\star} \circ f^{\star}(\tilde{x}) - \tilde{x}\|^2] + \lambda \cdot \text{penalty}$

Two types of penalties are explored:
- Min-Max Support Invariance Penalty (Equation \ref{eqn: pen_minmax}):
  
  $\sum_{p \neq q} \sum_{i \in \hat{\mathcal{S}}} \left( (\min_{z \in \tilde{\mathcal{Z}}_i^{(p)}} z - \min_{z \in \tilde{\mathcal{Z}}_i^{(q)}} z)^2 + (\max_{z \in \tilde{\mathcal{Z}}_i^{(p)}} z - \max_{z \in \tilde{\mathcal{Z}}_i^{(q)}} z)^2 \right)$
  
  where $\tilde{\mathcal{Z}}_i^{(p)}$ is the support of the $i$ -th component of $f^{\star}(\tilde{x})$ in domain $p$ .
- MMD-based Distribution Invariance Penalty (Equation \ref{eqn: pen_mmd}):
  
  $\sum_{p \neq q} \mathrm{MMD}(p_{\hat{z}_{\hat{\mathcal{S}}}^{(p)}}, p_{\hat{z}_{\hat{\mathcal{S}}}^{(q)}})$
  
  This measures the Maximum Mean Discrepancy between the joint distributions of the selected latent subset $\hat{z}_{\hat{\mathcal{S}}}$ across domain pairs.

Empirical Findings

Experiments were conducted on four types of datasets with varying mixing functions $g$ and latent distributions $p_Z$ :

Linear mixing: $x=Az$ .
Polynomial mixing: $g(z)$ is a polynomial function.
Image rendering of balls: Latent variables are ball coordinates, $g$ is an image renderer.
Unlabeled colored MNIST: Digits are $z_{\mathcal{S}}$ (implicitly), color is $z_{\mathcal{U}}$ .

For each, two types of $p_Z$ were studied:

Independent latents: $z_{\mathcal{S}}$ and $z_{\mathcal{U}}$ are independent.
Dependent latents (Dynamic SCM - D-SCM): The SCM for latents varies across data points, inducing dependencies.

Implementation of Experiments:

The two-stage procedure was used.
For linear data, Stage 2 was applied directly.
For polynomial data (Stage 1: MLP encoder, polynomial decoder) and image data (Stage 1: ResNet encoder, ConvNet decoder), MLP autoencoders were used in Stage 2.
Three penalty variations were tested: Min-Max, MMD, and MMD + Min-Max.

Evaluation Metrics:

For synthetic/balls data: $R^2_{\mathcal{S}}$ (R-squared for predicting $z_{\mathcal{S}}$ from $\hat{z}_{\hat{\mathcal{S}}}$ ) and $R^2_{\mathcal{U}}$ (R-squared for predicting $z_{\mathcal{U}}$ from $\hat{z}_{\hat{\mathcal{S}}}$ ). Ideal is high $R^2_{\mathcal{S}}$ and low $R^2_{\mathcal{U}}$ .
For unlabeled colored MNIST: $Acc_{\text{digits}}$ (accuracy of predicting digit from $\hat{z}_{\hat{\mathcal{S}}}$ ) and $R^2_{\text{color}}$ (R-squared for predicting color from $\hat{z}_{\hat{\mathcal{S}}}$ ).

Key Results:

For linear and polynomial mixing, all three penalty types performed well in achieving block-affine disentanglement.
For the more complex ball-images and unlabeled colored MNIST datasets, the combination "MMD + Min-Max" penalty worked best.
The approach achieved notable disentanglement on the challenging unlabeled colored MNIST without using any labels during training.
Increasing the number of domains ( $k$ ) generally improved identification. The number of domains required for useful identification was often less than worst-case theoretical bounds. For instance, going from $k=2$ to $k=16$ showed significant improvements (Tables \ref{table5_results}, \ref{table6_results}).

Architectural Details for Experiments:

Polynomial Mixing (Stage 1):
- Encoder: MLP (Input: $n \rightarrow n/2 \rightarrow n/2 \rightarrow d$ ) with LeakyReLU.
- Decoder: Polynomial decoder with learnable coefficient matrix.
Balls Dataset (Stage 1):
- Encoder: ResNet18.
- Decoder: Standard deconvolutional layers.
- Encoder output: 128-dim, invariance on first 64-dim.
Unlabeled Colored MNIST (Stage 1):
- Encoder: Linear (784 $\rightarrow$ 256 $\rightarrow$ 256 $\rightarrow$ 128) with ReLU & BatchNorm.
- Decoder: Symmetric.
Unlabeled Colored MNIST (Stage 2):
- Encoder: Linear (128 $\rightarrow$ 200 $\rightarrow$ 200 $\rightarrow$ 200 $\rightarrow$ 128) with LeakyReLU & BatchNorm.
- Decoder: Symmetric.

Training Details:

Optimizer: Adam (lr= $10^{-3}$ , $\beta_1=0.9, \beta_2=0.999$ ).
LR scheduler: Reduce on plateau (factor 0.5, patience 10 epochs, min_lr $10^{-4}$ ).
Batch size: 1024. Early stopping at 2000 steps.
Invariance penalty weight ( $\lambda$ ): 1.0.
MMD kernel: RBF (bandwidth 1.0, adaptive for linear mixing).
Min-Max penalty: Sorted batch, top 10 values used for min/max robustness.

Conclusions

The paper significantly advances multi-domain causal representation learning by relaxing strong assumptions and introducing a framework based on weak distributional invariances. It demonstrates theoretically and empirically that autoencoders constrained by these invariances can identify stable latent factors from unstable ones under complex domain shifts, including multi-node imperfect interventions and settings where a fixed DAG does not govern the entire dataset. The proposed methods show promise for real-world applications where data comes from diverse sources with varying underlying conditions.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Multi-Domain Causal Representation Learning via Weak Distributional Invariances (2310.02854v3)

Summary

Follow-up Questions

Related Papers

Authors (3)