Latent Variable Greedy Equivalence Search (LGES)
- Latent Variable Greedy Equivalence Search is a score-based algorithm designed to recover causal structures in partially observed linear models by decomposing covariance into sparse and low-rank components.
- It employs specialized operators for latent-observed edge deletion and latent structure refinement, ensuring efficient exploration of the Markov equivalence class.
- Empirical results show that LGES achieves robust F1 scores, lower SHD, and interpretable latent factor recovery under realistic GNFM assumptions.
Latent Variable Greedy Equivalence Search (LGES) refers to a class of score-based greedy search algorithms designed for structure identification in causal graphical models, particularly in the presence of latent (unobserved) variables. LGES generalizes classical Greedy Equivalence Search (GES) to partially observed systems, aiming to recover the Markov equivalence class of models that reproduce the observed joint statistics, with rigorous identifiability guarantees. Recent developments have established the first globally consistent score-based framework for latent variable recovery in linear structural models, notably under the Generalized N Factor Model assumption (Dong et al., 5 Oct 2025).
1. Formal Definition and Scope
LGES is a score-based greedy search algorithm on the space of partially observed linear causal models. Given samples of observed variables (and possibly associated covariates ), where the underlying data-generating process includes unobserved latent variables , LGES seeks to identify the optimal graphical structure—including both measurement and latent variable relationships—up to Markov equivalence. The algorithm is applicable whenever no extra information (e.g., number or locations of latents) is provided a priori. Its identifiability guarantees rely on the ability to decompose the covariance structure of the observed variables into direct (sparse) and latent (low-rank or factor-structured) contributions.
2. Generalized N Factor Model
Central to the identifiability and effectiveness of LGES is the Generalized N Factor Model (GNFM) framework (Dong et al., 5 Oct 2025):
- The latent variable model is defined as a directed acyclic graph (DAG), possibly with mutually nonadjacent latent variables grouped into .
- For each latent group , there exist at least observed "effect" variables such that each has exactly as its parents.
- Latent group membership propagates required equality constraints to the observed covariance matrix, enabling algebraic identifiability.
- If any variable, observed or latent, is causally related to a latent variable in , it must have the same relation to every member of .
The GNFM extends classical one-factor models, and, under ML score maximization together with minimum dimension regularization, LGES provably recovers the correct Markov equivalence class in the sample limit.
3. Algorithmic Framework
LGES operates in two core phases (Dong et al., 5 Oct 2025):
- Latent-Observed Edge Deletion: Initialization with a "supergraph" state containing all putative latent variables, each posited to cause all observed variables. Edge deletions from latent to observed nodes are greedily proposed if they do not degrade the maximum likelihood (ML) score beyond a pre-specified tolerance .
- Latent Structure Refinement: Once a minimal latent-observed structure is inferred, additional edge deletions or orientations among latents are performed, subject again to score preservation. This phase eliminates unnecessary latent-latent connections, refining the structure within the Markov equivalence class.
At all steps, the state is represented as a CPDAG (Completed Partially Directed Acyclic Graph) to efficiently encode equivalence classes. Two operator types navigate the search space:
- : deletes all edges from latent set to observed set in the CPDAG .
- : deletes all edges between latent groups and , with additional orientation toward helper set as required by GNFM.
The scoring function is
where encodes structural coefficients and encodes error variances.
4. Identifiability and Global Consistency
Under mild graphical assumptions enforced by the GNFM (specifically, sufficient observed coverage for each latent group), the score-based greedy search attains global consistency:
- In the large-sample regime, the algorithm selects the minimum-dimension graph maximizing ML score.
- The selected graph imposes the same set of covariance equality constraints as the true data-generating graph .
- For GNFM classes, algebraic equivalence implies recovery of the full Markov Equivalence Class (MEC): all features inferable from the observational distribution are reconstructed.
- The tolerance parameter is chosen on the order of for sample size , analogously to the BIC penalty in classical GES.
5. Operator Properties and Search Efficiency
LGES leverages tailored operators for efficient navigation: | Operator | Purpose | Acceptance Criterion | |:-----------------|:------------------------------------------|:----------------------------------------| | | Delete latent-observed edges | ML score does not degrade > | | | Delete and orient latent-latent edges | ML score does not degrade > |
Each operator maintains the current CPDAG as a supergraph of the true structure, thereby avoiding "over-pruning" and facilitating parallelized evaluation. Deletions monotonically reduce free parameters, leading to efficient exploration even in high-dimensional settings.
6. Evaluation and Applications
Empirical studies (Dong et al., 5 Oct 2025) demonstrate:
- LGES achieves superior F1 scores and lower SHD compared to constraint-based approaches (FOFC, GIN, RLCD) on synthetic data matching GNFM assumptions.
- Robust performance under misspecification (non-Gaussian noise, mild non-linearity) owing to reliance on covariance constraints.
- Inference of interpretable latent structures in real-world datasets, such as personality, burnout, and multitasking behavior, with well-calibrated model fit metrics (RMSEA, CFI, TLI). Extracted latent variables correspond to established psychological factors and reveal item-level cross-loadings.
7. Relation to Broader Latent Variable Model Selection
LGES extends the scope and guarantees of previous latent variable graphical model selection frameworks (Chandrasekaran et al., 2010, Frot et al., 2015) by explicitly handling partially observed systems through direct score-based search. The method incorporates high-dimensional consistency, convex optimization insights, and geometric conditions for identifiability—ensuring unique decomposition of observed covariances into sparse and low-rank components according to tangent space transversality. This synthesis places LGES at the intersection of score-based equivalence search, algebraic latent structure disentanglement, and practical structure discovery in the presence of hidden variables.
In summary, Latent Variable Greedy Equivalence Search (LGES) provides a consistent, principled, and scalable framework for causal structure learning in the presence of latent variables. Its design leverages global covariance constraints, specialized operator definitions, and careful regularization to achieve algebraic and Markov equivalence recovery under realistic graphical assumptions, marking a significant advance for latent variable discovery in empirical sciences.