Hierarchical Causal Models (2401.05330v2)

Published 10 Jan 2024 in stat.ME and stat.ML

Abstract: Scientists often want to learn about cause and effect from hierarchical data, collected from subunits nested inside units. Consider students in schools, cells in patients, or cities in states. In such settings, unit-level variables (e.g. each school's budget) may affect subunit-level variables (e.g. the test scores of each student in each school) and vice versa. To address causal questions with hierarchical data, we propose hierarchical causal models, which extend structural causal models and causal graphical models by adding inner plates. We develop a general graphical identification technique for hierarchical causal models that extends do-calculus. We find many situations in which hierarchical data can enable causal identification even when it would be impossible with non-hierarchical data, that is, if we had only unit-level summaries of subunit-level variables (e.g. the school's average test score, rather than each student's score). We develop estimation techniques for hierarchical causal models, using methods including hierarchical Bayesian models. We illustrate our results in simulation and via a reanalysis of the classic "eight schools" study.

Citations (2)

View on Semantic Scholar

Summary

The paper proposes a framework using hierarchical causal models to identify causal effects in nested data, overcoming limitations of aggregated methods.
It introduces a stepwise approach that collapses, augments, and marginalizes graphs to enable standard do-calculus for causal inference.
The study demonstrates practical estimation techniques, including hierarchical Bayesian models, with applications like the eight schools reanalysis.

This paper introduces Hierarchical Causal Models (HCMs) as an extension of structural causal models (SCMs) and causal graphical models (CGMs) designed for analyzing hierarchical data, where subunits are nested within units (e.g., students in schools, cells in patients). The primary contribution is a framework for identifying and estimating causal effects in such settings, particularly when unobserved unit-level confounders are present. The authors demonstrate that leveraging subunit-level data can enable causal identification in scenarios where it would be impossible with only aggregated unit-level data.

Key Concepts and Definitions:

Hierarchical Data: Data collected from subunits (e.g., individual students $j$ in school $i$ ) nested within units (e.g., school $i$ ). Variables can be unit-level ( $X_i$ ) or subunit-level ( $A_{ij}, Y_{ij}$ ).
Hierarchical Structural Causal Models (HSCMs): These extend SCMs by incorporating plates for subunits.
- Subunit-level variables ( $Y_{ij}$ ) are generated by a deterministic mechanism $f^y$ that takes as input its endogenous parents (both unit-level $X_i$ and subunit-level $A_{ij}$ ), unit-level exogenous noise $\gamma_i^y$ (shared by all subunits in unit $i$ ), and subunit-level exogenous noise $\epsilon_{ij}^y$ (specific to subunit $j$ in unit $i$ ).
  
  $y_{ij} = f^y(x_i, \gamma_i^y, a_{ij}, \epsilon^y_{ij})$
- Unit-level variables ( $Y_i$ ) can depend on other unit-level variables ( $X_i, \gamma_i^y$ ) and an entire set of subunit-level variables and their noise within that unit ( $\{(a_{ij}, \epsilon^y_{ij})\}_{j=1}^m$ ). The mechanism $f^y$ must be invariant to the order of these subunit-level inputs.
  
  $y_{i} = f^y(x_i, \gamma_i^y, \{(a_{ij}, \epsilon^y_{ij})\}_{j=1}^m)$
Hierarchical Causal Graphical Models (HCGMs): Derived from HSCMs by integrating out the exogenous noise variables, resulting in stochastic mechanisms.
- For subunit variables, this introduces $Q$ variables, which are unit-level distributions. For example, $Q_i^{y|a}$ represents the conditional distribution $p(Y_{ij} | A_{ij})$ specific to unit $i$ . This $Q_i^{y|a}$ is itself drawn from a distribution $p(q^{y|a} | x_i)$ that depends on unit-level parents.
  
  $Q_i^{y \mid a} \sim p\big (q^{y \mid a}\, \big|\, x_i\big)$
  
  $Y_{ij} \sim q_i^{y \mid a}(y \mid a_{ij})$
- Interventions on subunit variables (e.g., $do(A_{ij} \sim q_\star^a(a))$ ) can be conceptualized as hard interventions on their corresponding $Q$ variables (e.g., $do(Q_i^a = q_\star^a)$ ).

Identification Strategy:

The paper proposes a multi-step graphical procedure to determine if a causal effect is identifiable from an HCGM, assuming infinite data from both units and subunits. The core idea is to transform the HCGM into an equivalent "flat" (non-hierarchical) CGM to which standard do-calculus can be applied.

Collapse: The HCGM is "collapsed" into a flat CGM by treating the unit-level $Q$ $Q$ variables (e.g., $Q_i^a, Q_i^{y|a}$ $Q_{i}^{a}, Q_{i}^{y ∣ a}$ ) as endogenous observed variables (assuming infinite subunit data $m \to \infty$ $m \to \infty$ ). Subunit-level endogenous variables are removed. The mechanisms for unit-level variables that depend on subunit-level variables must converge to depend on the $Q$ $Q$ distributions of those subunit variables. For example, $p(Z_i | \{A_{ij}\}_{j=1}^m)$ $p (Z_{i} ∣ {A_{ij}}_{j = 1}^{m})$ must converge to $p(Z_i | Q_i^a)$ $p (Z_{i} ∣ Q_{i}^{a})$ .
- Algorithm: \Cref{alg:collapse} in the paper details this graphical transformation.
- Assumption: Mechanism convergence (\Cref{asm:inf-subunit}) is key for this step.
Augment: The collapsed model can be augmented by adding new unit-level endogenous variables that are deterministic functions of existing $Q$ $Q$ variables. For instance, a variable $Q_i^y$ $Q_{i}^{y}$ representing the marginal distribution of $Y$ $Y$ within unit $i$ $i$ can be added, where $q_i^y = \int q_i^a(a) q_i^{y|a}(y|a) da$ $q_{i}^{y} = \int q_{i}^{a} (a) q_{i}^{y ∣ a} (y ∣ a) d a$ .
- Algorithm: \Cref{alg:augment} details adding such variables.
Marginalize: Some variables (typically parents of an augmented variable that have only the augmented variable as a child) can be marginalized out of the augmented collapsed model. This can help satisfy positivity assumptions for do-calculus or simplify the graph.
- Algorithm: \Cref{alg:marginalize} details this removal.
Apply Do-Calculus: Standard do-calculus is applied to the final (collapsed, augmented, and/or marginalized) flat CGM to derive an identification formula.
- Assumptions: Subunit-level positivity (\Cref{assume:subunit_positive}) ensures $Q$ variables are well-defined. Unit-level positivity (\Cref{assume:unit_positive}, \Cref{asm:positivity_general}) is needed for do-calculus.

Estimation:

Once an estimand is identified, the paper outlines how to estimate it from finite data $\{x_i^{\mathcal{U}_{obs}}, \{x_{ij}^{\mathcal{S}_{obs}}\}_{j=1}^m\}_{i=1}^n$ .

Estimate Per-Unit $Q$ Variables: For each unit $i$ , estimate the relevant $Q$ distributions from its subunit data. For example, $\hat{q}_i^{y|a}$ can be estimated by fitting a regression model (e.g., linear regression, logistic regression, or a neural network) to predict $Y_{ij}$ from $A_{ij}$ using data $\{a_{ij}, y_{ij}\}_{j=1}^m$ from unit $i$ .
Estimate Population Distributions: Use the estimated per-unit $Q$ variables $\{\hat{q}_i\}$ from all $n$ units to estimate population-level distributions, such as $p(q^{y|a}|q^a, z)$ or $p(z|q^a)$ . This often involves fitting models where the parameters of the $Q$ distributions are themselves outcomes or predictors (e.g., using logistic regression to predict $Z_i$ from parameters of $\hat{q}_i^a$ ).
Plug into Identification Formula: Substitute these estimated components into the identification formula derived via do-calculus. This often involves averaging or integrating over the empirical distribution of estimated $Q$ variables or samples from their estimated population distribution.
Hierarchical Bayesian Models: The estimation process naturally lends itself to hierarchical Bayesian modeling. For example, $q_i^{y|a}$ can be parameterized (e.g., by coefficients of a linear model $\mu_i^{y|a}(a)$ ), and these parameters can be assumed to be drawn from a population distribution $p(\mu^{y|a}|u_i)$ . This approach was used in the "Eight Schools" example.

Illustrative Examples from the Paper:

The paper uses three running examples to illustrate the concepts:

Confounder (Fig. 1b, \Cref{fig:hcm_confounder}): $U_i \to A_{ij}$ $U_{i} \to A_{ij}$ , $U_i \to Y_{ij}$ $U_{i} \to Y_{ij}$ , $A_{ij} \to Y_{ij}$ $A_{ij} \to Y_{ij}$ . The effect $E[Y|do(a=a_\star)]$ $E [Y ∣ d o (a = a_{⋆})]$ is identified by averaging per-school predictions: $\frac{1}{n} \sum_i \hat{E}[Y|A=a_\star, U=u_i]$ $\frac{1}{n} \sum_{i} \hat{E} [Y ∣ A = a_{⋆}, U = u_{i}]$ . This is essentially stratifying by school, where each school has a fixed $U_i$ $U_{i}$ .
- Implementation: For each school $i$ , fit a model $\hat{\mu}_i(a) \approx E[Y|A=a, U=u_i]$ using student data $\{A_{ij}, Y_{ij}\}$ . The estimate is $\frac{1}{n} \sum_i \hat{\mu}_i(a_\star)$ .
Confounder with Interference (Fig. 2e, \Cref{fig:hcm_interfere}): $U_i \to A_{ij}$ $U_{i} \to A_{ij}$ , $U_i \to Y_{ij}$ $U_{i} \to Y_{ij}$ , $A_{ij} \to Z_i$ $A_{ij} \to Z_{i}$ , $Z_i \to Y_{ij}$ $Z_{i} \to Y_{ij}$ . Here, subunit treatments $A_{ij}$ $A_{ij}$ influence a unit-level variable $Z_i$ $Z_{i}$ (e.g., school-wide discussion), which in turn affects all subunit outcomes $Y_{ij}$ $Y_{ij}$ . Identification involves a front-door adjustment via $Z_i$ $Z_{i}$ in the collapsed model.
- Implementation: Requires estimating $p(Z_i|Q_i^a)$ and $p(Q_i^{y|a}|Z_i, Q_i^a)$ from data, then applying the front-door formula.
Instrument (Fig. 2i, \Cref{fig:hcm_instrument}): $U_i \to A_{ij}$ $U_{i} \to A_{ij}$ , $U_i \to Y_i$ $U_{i} \to Y_{i}$ , $Z_{ij} \to A_{ij}$ $Z_{ij} \to A_{ij}$ , $A_{ij} \to Y_i$ $A_{ij} \to Y_{i}$ . Here, $Z_{ij}$ $Z_{ij}$ is a subunit-level instrument, and $Y_i$ $Y_{i}$ is a unit-level outcome. Identification uses a backdoor adjustment in the collapsed/augmented/marginalized model.
- Implementation: Estimate per-unit $Q_i^{a|z}$ and $Q_i^a$ . Estimate $p(Y_i|Q_i^a, Q_i^{a|z})$ . Then apply the backdoor formula by averaging over the empirical distribution of $Q_i^{a|z}$ .

Application: Eight Schools Reanalysis (\Cref{sec:eight_schools})

The authors reanalyze the classic "eight schools" dataset, where students (subunits) in schools (units) were randomized to a test preparation program ( $A_{ij}$ ), and their SAT scores ( $Y_{ij}$ ) were measured, controlling for pre-treatment scores ( $X_{ij}$ ).

Standard Analysis as HCM: The standard hierarchical Bayesian analysis is shown to be equivalent to estimation under an HCM assuming $A_{ij}$ is randomized within schools, possibly with unobserved school-level confounders $U_i$ affecting $X_{ij}, A_{ij}, Y_{ij}$ (\Cref{fig:school_confound_hcm}). The estimand $E[Y|do(A=1)] - E[Y|do(A=0)]$ is identified.
Adding Interference: They extend the model to include potential interference: the number of students enrolled ( $A_{ij}$ $A_{ij}$ ) might affect class size ( $C_i$ $C_{i}$ ), which then affects scores ( $Y_{ij}$ $Y_{ij}$ ) (\Cref{fig:school_interfere_hcm_0}).
- The identification formula becomes more complex (\Cref{eqn:schools_interfere_id}), involving terms like $p(C_i | Q_i^a, S_i)$ (student interest $S_i$ ) and $p(Q_i^{y|a} | Q_i^a, S_i, C_i)$ .
- Implementation: A hierarchical Bayesian model was built to estimate these components.
- Result: While the posterior mean ATE was similar, accounting for interference significantly increased the uncertainty (wider posterior variance) about the program's effectiveness.

Theoretical Insights:

When Hierarchy Helps: Hierarchy (subunit-level data) primarily aids identification when the treatment variable is at the subunit level. It allows for conditioning on (or marginalizing over) the unit, effectively controlling for unit-level unobserved confounders $U_i$ $U_{i}$ .
- Theorem 3 (\Cref{thm:sufficient_ID}) provides sufficient conditions: if $A$ is subunit-level, its effect is identifiable if there's no bi-directed path from $A$ to its direct unit descendants in the collapsed model, OR if $A$ has a subunit-level instrument.
When Hierarchy Doesn't Help (Much): If the treatment variable $A$ is at the unit level, disaggregating other variables into subunits generally does not help with identification if it was not identifiable in the "erased inner plate" model (Theorem 4, \Cref{thm:unit-level-noID}).
Marginalization Rules (\Cref{appx:marginalization}): Unlike flat SCMs where any variable with one child can be marginalized out, in HSCMs, a unit-level variable with a subunit parent and a subunit child (an "interferer" like $Z_i$ in \Cref{fig:hcm_interfere}) cannot be marginalized out.

Practical Implications and Limitations:

Leveraging Modern Data: HCMs provide a formal framework for causal inference with increasingly available granular data (e.g., individual user data from apps, single-cell data in biology).
Nonparametric Nature: The identification results are largely nonparametric regarding the functional forms of causal mechanisms, making them applicable to complex, high-dimensional data.
Computational Cost: Estimation, especially using hierarchical Bayesian models with MCMC, can be computationally intensive, particularly with many units or complex per-unit models.
Assumptions:
- Known Causal Graph: The method assumes the causal graph structure is known.
- Mechanism Convergence: Unit-level mechanisms depending on subunit variables must converge as $m \to \infty$ . This holds for mechanisms depending on averages of subunit features but might not for sums if not normalized (\Cref{sec:exp-family-mechanisms}).
- Large N, M: Theoretical guarantees (like convergence to collapsed model, learnability of $p(x^U, q(x^S))$ ) rely on large numbers of units ( $n$ ) and subunits ( $m$ ). Practical estimation with finite data involves standard statistical approximation errors.
- Positivity: Standard positivity assumptions for do-calculus apply to the transformed flat model.
Completeness: The proposed identification procedure (collapsing, augmenting, marginalizing, then do-calculus) is not proven to be complete for HCMs. There might be identifiable effects that this procedure fails to identify because the collapsed model has inherent structural constraints not present in general flat CGMs.

In summary, this paper offers a significant step towards principled causal inference from hierarchical data. It provides both a theoretical foundation for identification and practical guidance for estimation, bridging ideas from causal graphical models and hierarchical Bayesian statistics. The framework is particularly relevant given the rise of fine-grained, nested datasets across various scientific and engineering domains.

PDF Markdown

Related Papers

Tweets

https://twitter.com/EliWeinstein6/status/1748476896919110020

https://twitter.com/EliWeinstein6/status/1849150483467808893

https://twitter.com/adnanhofficial/status/1748324155538407657