Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
164 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hierarchical Causal Models (2401.05330v2)

Published 10 Jan 2024 in stat.ME and stat.ML

Abstract: Scientists often want to learn about cause and effect from hierarchical data, collected from subunits nested inside units. Consider students in schools, cells in patients, or cities in states. In such settings, unit-level variables (e.g. each school's budget) may affect subunit-level variables (e.g. the test scores of each student in each school) and vice versa. To address causal questions with hierarchical data, we propose hierarchical causal models, which extend structural causal models and causal graphical models by adding inner plates. We develop a general graphical identification technique for hierarchical causal models that extends do-calculus. We find many situations in which hierarchical data can enable causal identification even when it would be impossible with non-hierarchical data, that is, if we had only unit-level summaries of subunit-level variables (e.g. the school's average test score, rather than each student's score). We develop estimation techniques for hierarchical causal models, using methods including hierarchical Bayesian models. We illustrate our results in simulation and via a reanalysis of the classic "eight schools" study.

Citations (2)

Summary

  • The paper proposes a framework using hierarchical causal models to identify causal effects in nested data, overcoming limitations of aggregated methods.
  • It introduces a stepwise approach that collapses, augments, and marginalizes graphs to enable standard do-calculus for causal inference.
  • The study demonstrates practical estimation techniques, including hierarchical Bayesian models, with applications like the eight schools reanalysis.

This paper introduces Hierarchical Causal Models (HCMs) as an extension of structural causal models (SCMs) and causal graphical models (CGMs) designed for analyzing hierarchical data, where subunits are nested within units (e.g., students in schools, cells in patients). The primary contribution is a framework for identifying and estimating causal effects in such settings, particularly when unobserved unit-level confounders are present. The authors demonstrate that leveraging subunit-level data can enable causal identification in scenarios where it would be impossible with only aggregated unit-level data.

Key Concepts and Definitions:

  • Hierarchical Data: Data collected from subunits (e.g., individual students jj in school ii) nested within units (e.g., school ii). Variables can be unit-level (XiX_i) or subunit-level (Aij,YijA_{ij}, Y_{ij}).
  • Hierarchical Structural Causal Models (HSCMs): These extend SCMs by incorporating plates for subunits.
    • Subunit-level variables (YijY_{ij}) are generated by a deterministic mechanism fyf^y that takes as input its endogenous parents (both unit-level XiX_i and subunit-level AijA_{ij}), unit-level exogenous noise γiy\gamma_i^y (shared by all subunits in unit ii), and subunit-level exogenous noise ϵijy\epsilon_{ij}^y (specific to subunit jj in unit ii).

      yij=fy(xi,γiy,aij,ϵijy)y_{ij} = f^y(x_i, \gamma_i^y, a_{ij}, \epsilon^y_{ij})

    • Unit-level variables (YiY_i) can depend on other unit-level variables (Xi,γiyX_i, \gamma_i^y) and an entire set of subunit-level variables and their noise within that unit ({(aij,ϵijy)}j=1m\{(a_{ij}, \epsilon^y_{ij})\}_{j=1}^m). The mechanism fyf^y must be invariant to the order of these subunit-level inputs.

      yi=fy(xi,γiy,{(aij,ϵijy)}j=1m)y_{i} = f^y(x_i, \gamma_i^y, \{(a_{ij}, \epsilon^y_{ij})\}_{j=1}^m)

  • Hierarchical Causal Graphical Models (HCGMs): Derived from HSCMs by integrating out the exogenous noise variables, resulting in stochastic mechanisms.
    • For subunit variables, this introduces QQ variables, which are unit-level distributions. For example, QiyaQ_i^{y|a} represents the conditional distribution p(YijAij)p(Y_{ij} | A_{ij}) specific to unit ii. This QiyaQ_i^{y|a} is itself drawn from a distribution p(qyaxi)p(q^{y|a} | x_i) that depends on unit-level parents.

      Qiyap(qyaxi)Q_i^{y \mid a} \sim p\big (q^{y \mid a}\, \big|\, x_i\big)

      Yijqiya(yaij)Y_{ij} \sim q_i^{y \mid a}(y \mid a_{ij})

    • Interventions on subunit variables (e.g., do(Aijqa(a))do(A_{ij} \sim q_\star^a(a))) can be conceptualized as hard interventions on their corresponding QQ variables (e.g., do(Qia=qa)do(Q_i^a = q_\star^a)).

Identification Strategy:

The paper proposes a multi-step graphical procedure to determine if a causal effect is identifiable from an HCGM, assuming infinite data from both units and subunits. The core idea is to transform the HCGM into an equivalent "flat" (non-hierarchical) CGM to which standard do-calculus can be applied.

  1. Collapse: The HCGM is "collapsed" into a flat CGM by treating the unit-level QQ variables (e.g., Qia,QiyaQ_i^a, Q_i^{y|a}) as endogenous observed variables (assuming infinite subunit data mm \to \infty). Subunit-level endogenous variables are removed. The mechanisms for unit-level variables that depend on subunit-level variables must converge to depend on the QQ distributions of those subunit variables. For example, p(Zi{Aij}j=1m)p(Z_i | \{A_{ij}\}_{j=1}^m) must converge to p(ZiQia)p(Z_i | Q_i^a).
    • Algorithm: \Cref{alg:collapse} in the paper details this graphical transformation.
    • Assumption: Mechanism convergence (\Cref{asm:inf-subunit}) is key for this step.
  2. Augment: The collapsed model can be augmented by adding new unit-level endogenous variables that are deterministic functions of existing QQ variables. For instance, a variable QiyQ_i^y representing the marginal distribution of YY within unit ii can be added, where qiy=qia(a)qiya(ya)daq_i^y = \int q_i^a(a) q_i^{y|a}(y|a) da.
    • Algorithm: \Cref{alg:augment} details adding such variables.
  3. Marginalize: Some variables (typically parents of an augmented variable that have only the augmented variable as a child) can be marginalized out of the augmented collapsed model. This can help satisfy positivity assumptions for do-calculus or simplify the graph.
    • Algorithm: \Cref{alg:marginalize} details this removal.
  4. Apply Do-Calculus: Standard do-calculus is applied to the final (collapsed, augmented, and/or marginalized) flat CGM to derive an identification formula.
    • Assumptions: Subunit-level positivity (\Cref{assume:subunit_positive}) ensures QQ variables are well-defined. Unit-level positivity (\Cref{assume:unit_positive}, \Cref{asm:positivity_general}) is needed for do-calculus.

Estimation:

Once an estimand is identified, the paper outlines how to estimate it from finite data {xiUobs,{xijSobs}j=1m}i=1n\{x_i^{\mathcal{U}_{obs}}, \{x_{ij}^{\mathcal{S}_{obs}}\}_{j=1}^m\}_{i=1}^n.

  1. Estimate Per-Unit QQ Variables: For each unit ii, estimate the relevant QQ distributions from its subunit data. For example, q^iya\hat{q}_i^{y|a} can be estimated by fitting a regression model (e.g., linear regression, logistic regression, or a neural network) to predict YijY_{ij} from AijA_{ij} using data {aij,yij}j=1m\{a_{ij}, y_{ij}\}_{j=1}^m from unit ii.
  2. Estimate Population Distributions: Use the estimated per-unit QQ variables {q^i}\{\hat{q}_i\} from all nn units to estimate population-level distributions, such as p(qyaqa,z)p(q^{y|a}|q^a, z) or p(zqa)p(z|q^a). This often involves fitting models where the parameters of the QQ distributions are themselves outcomes or predictors (e.g., using logistic regression to predict ZiZ_i from parameters of q^ia\hat{q}_i^a).
  3. Plug into Identification Formula: Substitute these estimated components into the identification formula derived via do-calculus. This often involves averaging or integrating over the empirical distribution of estimated QQ variables or samples from their estimated population distribution.
  4. Hierarchical Bayesian Models: The estimation process naturally lends itself to hierarchical Bayesian modeling. For example, qiyaq_i^{y|a} can be parameterized (e.g., by coefficients of a linear model μiya(a)\mu_i^{y|a}(a)), and these parameters can be assumed to be drawn from a population distribution p(μyaui)p(\mu^{y|a}|u_i). This approach was used in the "Eight Schools" example.

Illustrative Examples from the Paper:

The paper uses three running examples to illustrate the concepts:

  1. Confounder (Fig. 1b, \Cref{fig:hcm_confounder}): UiAijU_i \to A_{ij}, UiYijU_i \to Y_{ij}, AijYijA_{ij} \to Y_{ij}. The effect E[Ydo(a=a)]E[Y|do(a=a_\star)] is identified by averaging per-school predictions: 1niE^[YA=a,U=ui]\frac{1}{n} \sum_i \hat{E}[Y|A=a_\star, U=u_i]. This is essentially stratifying by school, where each school has a fixed UiU_i.
    • Implementation: For each school ii, fit a model μ^i(a)E[YA=a,U=ui]\hat{\mu}_i(a) \approx E[Y|A=a, U=u_i] using student data {Aij,Yij}\{A_{ij}, Y_{ij}\}. The estimate is 1niμ^i(a)\frac{1}{n} \sum_i \hat{\mu}_i(a_\star).
  2. Confounder with Interference (Fig. 2e, \Cref{fig:hcm_interfere}): UiAijU_i \to A_{ij}, UiYijU_i \to Y_{ij}, AijZiA_{ij} \to Z_i, ZiYijZ_i \to Y_{ij}. Here, subunit treatments AijA_{ij} influence a unit-level variable ZiZ_i (e.g., school-wide discussion), which in turn affects all subunit outcomes YijY_{ij}. Identification involves a front-door adjustment via ZiZ_i in the collapsed model.
    • Implementation: Requires estimating p(ZiQia)p(Z_i|Q_i^a) and p(QiyaZi,Qia)p(Q_i^{y|a}|Z_i, Q_i^a) from data, then applying the front-door formula.
  3. Instrument (Fig. 2i, \Cref{fig:hcm_instrument}): UiAijU_i \to A_{ij}, UiYiU_i \to Y_i, ZijAijZ_{ij} \to A_{ij}, AijYiA_{ij} \to Y_i. Here, ZijZ_{ij} is a subunit-level instrument, and YiY_i is a unit-level outcome. Identification uses a backdoor adjustment in the collapsed/augmented/marginalized model.
    • Implementation: Estimate per-unit QiazQ_i^{a|z} and QiaQ_i^a. Estimate p(YiQia,Qiaz)p(Y_i|Q_i^a, Q_i^{a|z}). Then apply the backdoor formula by averaging over the empirical distribution of QiazQ_i^{a|z}.

Application: Eight Schools Reanalysis (\Cref{sec:eight_schools})

The authors reanalyze the classic "eight schools" dataset, where students (subunits) in schools (units) were randomized to a test preparation program (AijA_{ij}), and their SAT scores (YijY_{ij}) were measured, controlling for pre-treatment scores (XijX_{ij}).

  • Standard Analysis as HCM: The standard hierarchical Bayesian analysis is shown to be equivalent to estimation under an HCM assuming AijA_{ij} is randomized within schools, possibly with unobserved school-level confounders UiU_i affecting Xij,Aij,YijX_{ij}, A_{ij}, Y_{ij} (\Cref{fig:school_confound_hcm}). The estimand E[Ydo(A=1)]E[Ydo(A=0)]E[Y|do(A=1)] - E[Y|do(A=0)] is identified.
  • Adding Interference: They extend the model to include potential interference: the number of students enrolled (AijA_{ij}) might affect class size (CiC_i), which then affects scores (YijY_{ij}) (\Cref{fig:school_interfere_hcm_0}).
    • The identification formula becomes more complex (\Cref{eqn:schools_interfere_id}), involving terms like p(CiQia,Si)p(C_i | Q_i^a, S_i) (student interest SiS_i) and p(QiyaQia,Si,Ci)p(Q_i^{y|a} | Q_i^a, S_i, C_i).
    • Implementation: A hierarchical Bayesian model was built to estimate these components.
    • Result: While the posterior mean ATE was similar, accounting for interference significantly increased the uncertainty (wider posterior variance) about the program's effectiveness.

Theoretical Insights:

  • When Hierarchy Helps: Hierarchy (subunit-level data) primarily aids identification when the treatment variable is at the subunit level. It allows for conditioning on (or marginalizing over) the unit, effectively controlling for unit-level unobserved confounders UiU_i.
    • Theorem 3 (\Cref{thm:sufficient_ID}) provides sufficient conditions: if AA is subunit-level, its effect is identifiable if there's no bi-directed path from AA to its direct unit descendants in the collapsed model, OR if AA has a subunit-level instrument.
  • When Hierarchy Doesn't Help (Much): If the treatment variable AA is at the unit level, disaggregating other variables into subunits generally does not help with identification if it was not identifiable in the "erased inner plate" model (Theorem 4, \Cref{thm:unit-level-noID}).
  • Marginalization Rules (\Cref{appx:marginalization}): Unlike flat SCMs where any variable with one child can be marginalized out, in HSCMs, a unit-level variable with a subunit parent and a subunit child (an "interferer" like ZiZ_i in \Cref{fig:hcm_interfere}) cannot be marginalized out.

Practical Implications and Limitations:

  • Leveraging Modern Data: HCMs provide a formal framework for causal inference with increasingly available granular data (e.g., individual user data from apps, single-cell data in biology).
  • Nonparametric Nature: The identification results are largely nonparametric regarding the functional forms of causal mechanisms, making them applicable to complex, high-dimensional data.
  • Computational Cost: Estimation, especially using hierarchical Bayesian models with MCMC, can be computationally intensive, particularly with many units or complex per-unit models.
  • Assumptions:
    • Known Causal Graph: The method assumes the causal graph structure is known.
    • Mechanism Convergence: Unit-level mechanisms depending on subunit variables must converge as mm \to \infty. This holds for mechanisms depending on averages of subunit features but might not for sums if not normalized (\Cref{sec:exp-family-mechanisms}).
    • Large N, M: Theoretical guarantees (like convergence to collapsed model, learnability of p(xU,q(xS))p(x^U, q(x^S))) rely on large numbers of units (nn) and subunits (mm). Practical estimation with finite data involves standard statistical approximation errors.
    • Positivity: Standard positivity assumptions for do-calculus apply to the transformed flat model.
  • Completeness: The proposed identification procedure (collapsing, augmenting, marginalizing, then do-calculus) is not proven to be complete for HCMs. There might be identifiable effects that this procedure fails to identify because the collapsed model has inherent structural constraints not present in general flat CGMs.

In summary, this paper offers a significant step towards principled causal inference from hierarchical data. It provides both a theoretical foundation for identification and practical guidance for estimation, bridging ideas from causal graphical models and hierarchical Bayesian statistics. The framework is particularly relevant given the rise of fine-grained, nested datasets across various scientific and engineering domains.