Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning

Published 12 Jun 2025 in cs.LG, cs.AI, cs.CL, and stat.ML | (2506.10378v1)

Abstract: Faithful evaluation of LLM capabilities is crucial for deriving actionable insights that can inform model development. However, rigorous causal evaluations in this domain face significant methodological challenges, including complex confounding effects and prohibitive computational costs associated with extensive retraining. To tackle these challenges, we propose a causal representation learning framework wherein observed benchmark performance is modeled as a linear transformation of a few latent capability factors. Crucially, these latent factors are identified as causally interrelated after appropriately controlling for the base model as a common confounder. Applying this approach to a comprehensive dataset encompassing over 1500 models evaluated across six benchmarks from the Open LLM Leaderboard, we identify a concise three-node linear causal structure that reliably explains the observed performance variations. Further interpretation of this causal structure provides substantial scientific insights beyond simple numerical rankings: specifically, we reveal a clear causal direction starting from general problem-solving capabilities, advancing through instruction-following proficiency, and culminating in mathematical reasoning ability. Our results underscore the essential role of carefully controlling base model variations during evaluation, a step critical to accurately uncovering the underlying causal relationships among latent model capabilities.

Abstract PDF Chat (Pro)

Summary

The paper introduces a causal framework that models benchmark performance as a linear transformation of latent factors, revealing a three-node hierarchical structure.
It employs Hierarchical Component Analysis (HCA) to decompose performance data across diverse base models while controlling for confounders.
Experimental results on over 1500 models demonstrate distinct latent capabilities that inform effective fine-tuning strategies and benchmark design.

This paper introduces a causal representation learning framework to discover and understand the hierarchical latent capabilities of large LMs. The core challenge addressed is the difficulty in faithfully evaluating LM capabilities due to complex confounding effects (especially from varying base models) and the high cost of retraining for controlled studies. The authors propose that observed benchmark performance can be modeled as a linear transformation of a few causally interrelated latent capability factors, once the base model is appropriately controlled as a common confounder.

The central methodology relies on two key hypotheses:

Capability-Performance Invariance: A small set of latent capability factors consistently governs benchmark performance across diverse base models.
Hierarchical Capability Structure: These capabilities are organized hierarchically (as a Directed Acyclic Graph - DAG) within any individual base model, where an edge $A \to B$ implies that interventions targeting capability $A$ influence capability $B$ .

The paper formalizes this using Pearl's structural causal model (SCM) framework, treating the base model as a shared latent parent influencing all capabilities and fine-tuning as an intervention on these latent factors.

To identify these hierarchical latent capabilities, the authors propose Hierarchical Component Analysis (HCA). This algorithm leverages the heterogeneity across different base models (referred to as "domains") to recover the latent structure. The key steps of HCA are:

ICA-based Unmixing: Apply Independent Component Analysis (ICA) separately to the benchmark performance data for each base model (domain). This yields an unmixing matrix $M_k$ for each domain $k$ , which maps observed benchmark data $x^{(k)}$ to independent source variables $\epsilon^{(k)}$ such that $M_k x^{(k)} = \epsilon^{(k)}$ . Theoretically, $M_k = P_k B_k H$ , where $P_k$ is a permutation, $B_k$ is a lower-triangular matrix representing domain-specific causal weights, and $H$ is a common unmixing matrix related to the shared latent capability structure.
Row-Residual Extraction: The algorithm aims to recover $B_k$ and $H$ . It iteratively identifies rows of $H$ . For each component $i$ , it computes the residual of projecting the $i$ -th row of $M_k^*$ (a permuted version of $M_k$ ) onto the span of its first $i-1$ rows. If this set of residuals across all domains $k$ is rank-1, it indicates a component of $H$ .
Permutation Alignment and Factor Refinement: Since ICA identifies components up to permutation, HCA searches over permutations of the rows of $M_k$ . For each permutation, it estimates $H$ and refines $B_k$ by minimizing $\|M_k' - B_k H\|_F^2$ where $M_k'$ is the permuted ICA unmixing matrix. The best set of permutations is chosen based on minimizing the Maximum Inexactness Coefficient (MIC), which quantifies how much the recovered source variables deviate from true independence.

The HCA algorithm is applied to data from the Open LLM Leaderboard, encompassing over 1500 models evaluated on six benchmarks. The analysis focuses on models fine-tuned from four base models (Qwen2.5-7B, Qwen2.5-14B, Llama-3-8B, Llama-3.1-8B) that exhibit similar principal component subspaces in their performance data, suggesting a shared underlying structure.

The experiments reveal a concise three-node linear causal structure ( $z_1 \to z_2 \to z_3$ ) that reliably explains observed performance variations, with a low MIC of 0.04. The latent capabilities are interpreted as:

$z_1$ (Foundational General Capability): Correlates strongly with general-reasoning benchmarks like MMLU-Pro and BIG-Bench-Hard (BBH). This capability is more influenced by pre-training compute (FLOPs).
$z_2$ (Instruction Following): Correlates strongly with IFEval. Interventions like instruction tuning primarily affect this capability, with minimal changes to $z_1$ .
$z_3$ (Advanced Mathematical Reasoning): Correlates strongly with MATH Lvl 5. This capability is causally influenced by $z_2$ , as mathematical tasks often require precise instruction adherence.

Implementation and Application Insights:

Controlling for Base Models: The study underscores the critical importance of accounting for the base model when evaluating LLMs. The heterogeneity in performance patterns across different base models necessitates this control for accurate causal discovery.
- Practical Tip: When evaluating fine-tuning strategies, report results specific to base models or use methods to adjust for base model effects.
Matrix Completion: The observation of domain-specific low-rank structures can improve the imputation of missing benchmark scores. Applying matrix completion with nuclear norm regularization within a specific base model's domain yields lower reconstruction error than applying it to the entire leaderboard.
- Application: More accurately fill in missing data on leaderboards by performing matrix completion locally within model families.
HCA Algorithm:
- Input: Benchmark performance data for multiple models, grouped by their base model (domains).
- Output: A shared unmixing matrix $H$ (mapping benchmarks to latent capabilities), domain-specific causal weight matrices $B_k$ , and the inferred causal graph among latent capabilities.
- Steps (Simplified):
- 1. Perform ICA on each domain's data to get $M_k$ .
- 2. Iterate through permutations of $M_k$ 's rows.
- 3. For each permutation, iteratively extract orthogonalized principal components to form $H$ .
- 4. Estimate $B_k$ for each domain.
- 5. Calculate MIC; select the permutation yielding the lowest MIC.
- Computational Cost: The permutation search can be expensive if the number of latent capabilities $d_0$ is large. For $d_0=3$ as in the paper, it's manageable.
- Code Availability: The paper mentions code release, which would be crucial for practical implementation.
Interpreting Latent Factors: After identifying latent factors $z_i$ $z_{i}$ , their semantic meaning is established by:
1. Examining the unmixing matrix $H$ : Which benchmarks load heavily onto which $z_i$ ?
2. Correlating $z_i$ values with specific benchmark scores.
3. Observing the effects of targeted fine-tuning (e.g., IFEval SFT for $z_2$ ) on benchmark scores and other $z_j$ .
Fine-tuning Strategy:
- The discovered hierarchy $z_1 \to z_2 \to z_3$ suggests a development path:
- Focus on scaling pre-training FLOPs to improve $z_1$ (general capability), as gains here can cascade.
- For capabilities like $z_2$ (instruction-following), which are less correlated with FLOPs and have higher noise-factor variances, targeted post-training interventions are effective.
- Improving $z_2$ can subsequently boost $z_3$ (math reasoning).
Benchmark Design:
- Prioritize benchmarks evaluating general, foundational capabilities ( $z_1$ -aligned) as they reflect more substantive improvements.
- Be aware that gains on specialized benchmarks (e.g., MATH) might partly stem from improvements in upstream capabilities (e.g., instruction-following).

Limitations and Future Work:

The identifiability of HCA relies on assumptions (e.g., linear SCM, sufficient heterogeneity across domains).
The MIC provides a quantitative measure of inexactness, but the model is still an approximation.
Interpreting and intervening on latent factors remains a general challenge in CRL.
The analysis was performed on a specific set of benchmarks and base models; generalizability to other setups needs further investigation.
The paper suggests using more advanced causal inference tools (matching, stratification, doubly robust estimation) for future insights.

The paper provides a novel framework for understanding LM capabilities not just as a flat list of scores, but as an interconnected, hierarchical system. This causal perspective offers actionable insights for model development, evaluation, and fine-tuning by revealing how different abilities build upon each other.