- The paper introduces a causal framework that models benchmark performance as a linear transformation of latent factors, revealing a three-node hierarchical structure.
- It employs Hierarchical Component Analysis (HCA) to decompose performance data across diverse base models while controlling for confounders.
- Experimental results on over 1500 models demonstrate distinct latent capabilities that inform effective fine-tuning strategies and benchmark design.
This paper introduces a causal representation learning framework to discover and understand the hierarchical latent capabilities of large LMs. The core challenge addressed is the difficulty in faithfully evaluating LM capabilities due to complex confounding effects (especially from varying base models) and the high cost of retraining for controlled studies. The authors propose that observed benchmark performance can be modeled as a linear transformation of a few causally interrelated latent capability factors, once the base model is appropriately controlled as a common confounder.
The central methodology relies on two key hypotheses:
- Capability-Performance Invariance: A small set of latent capability factors consistently governs benchmark performance across diverse base models.
- Hierarchical Capability Structure: These capabilities are organized hierarchically (as a Directed Acyclic Graph - DAG) within any individual base model, where an edge A→B implies that interventions targeting capability A influence capability B.
The paper formalizes this using Pearl's structural causal model (SCM) framework, treating the base model as a shared latent parent influencing all capabilities and fine-tuning as an intervention on these latent factors.
To identify these hierarchical latent capabilities, the authors propose Hierarchical Component Analysis (HCA). This algorithm leverages the heterogeneity across different base models (referred to as "domains") to recover the latent structure. The key steps of HCA are:
- ICA-based Unmixing: Apply Independent Component Analysis (ICA) separately to the benchmark performance data for each base model (domain). This yields an unmixing matrix Mk​ for each domain k, which maps observed benchmark data x(k) to independent source variables ϵ(k) such that Mk​x(k)=ϵ(k). Theoretically, Mk​=Pk​Bk​H, where Pk​ is a permutation, Bk​ is a lower-triangular matrix representing domain-specific causal weights, and H is a common unmixing matrix related to the shared latent capability structure.
- Row-Residual Extraction: The algorithm aims to recover Bk​ and H. It iteratively identifies rows of H. For each component i, it computes the residual of projecting the i-th row of Mk∗​ (a permuted version of Mk​) onto the span of its first i−1 rows. If this set of residuals across all domains k is rank-1, it indicates a component of H.
- Permutation Alignment and Factor Refinement: Since ICA identifies components up to permutation, HCA searches over permutations of the rows of Mk​. For each permutation, it estimates H and refines Bk​ by minimizing ∥Mk′​−Bk​H∥F2​ where Mk′​ is the permuted ICA unmixing matrix. The best set of permutations is chosen based on minimizing the Maximum Inexactness Coefficient (MIC), which quantifies how much the recovered source variables deviate from true independence.
The HCA algorithm is applied to data from the Open LLM Leaderboard, encompassing over 1500 models evaluated on six benchmarks. The analysis focuses on models fine-tuned from four base models (Qwen2.5-7B, Qwen2.5-14B, Llama-3-8B, Llama-3.1-8B) that exhibit similar principal component subspaces in their performance data, suggesting a shared underlying structure.
The experiments reveal a concise three-node linear causal structure (z1​→z2​→z3​) that reliably explains observed performance variations, with a low MIC of 0.04.
The latent capabilities are interpreted as:
- z1​ (Foundational General Capability): Correlates strongly with general-reasoning benchmarks like MMLU-Pro and BIG-Bench-Hard (BBH). This capability is more influenced by pre-training compute (FLOPs).
- z2​ (Instruction Following): Correlates strongly with IFEval. Interventions like instruction tuning primarily affect this capability, with minimal changes to z1​.
- z3​ (Advanced Mathematical Reasoning): Correlates strongly with MATH Lvl 5. This capability is causally influenced by z2​, as mathematical tasks often require precise instruction adherence.
Implementation and Application Insights:
- Controlling for Base Models: The study underscores the critical importance of accounting for the base model when evaluating LLMs. The heterogeneity in performance patterns across different base models necessitates this control for accurate causal discovery.
- Practical Tip: When evaluating fine-tuning strategies, report results specific to base models or use methods to adjust for base model effects.
- Matrix Completion: The observation of domain-specific low-rank structures can improve the imputation of missing benchmark scores. Applying matrix completion with nuclear norm regularization within a specific base model's domain yields lower reconstruction error than applying it to the entire leaderboard.
- Application: More accurately fill in missing data on leaderboards by performing matrix completion locally within model families.
- HCA Algorithm:
- Input: Benchmark performance data for multiple models, grouped by their base model (domains).
- Output: A shared unmixing matrix H (mapping benchmarks to latent capabilities), domain-specific causal weight matrices Bk​, and the inferred causal graph among latent capabilities.
- Steps (Simplified):
- 1. Perform ICA on each domain's data to get Mk​.
- 2. Iterate through permutations of Mk​'s rows.
- 3. For each permutation, iteratively extract orthogonalized principal components to form H.
- 4. Estimate Bk​ for each domain.
- 5. Calculate MIC; select the permutation yielding the lowest MIC.
- Computational Cost: The permutation search can be expensive if the number of latent capabilities d0​ is large. For d0​=3 as in the paper, it's manageable.
- Code Availability: The paper mentions code release, which would be crucial for practical implementation.
- Interpreting Latent Factors: After identifying latent factors zi​, their semantic meaning is established by:
- Examining the unmixing matrix H: Which benchmarks load heavily onto which zi​?
- Correlating zi​ values with specific benchmark scores.
- Observing the effects of targeted fine-tuning (e.g., IFEval SFT for z2​) on benchmark scores and other zj​.
Fine-tuning Strategy:
- The discovered hierarchy z1​→z2​→z3​ suggests a development path:
- Focus on scaling pre-training FLOPs to improve z1​ (general capability), as gains here can cascade.
- For capabilities like z2​ (instruction-following), which are less correlated with FLOPs and have higher noise-factor variances, targeted post-training interventions are effective.
- Improving z2​ can subsequently boost z3​ (math reasoning).
- Benchmark Design:
- Prioritize benchmarks evaluating general, foundational capabilities (z1​-aligned) as they reflect more substantive improvements.
- Be aware that gains on specialized benchmarks (e.g., MATH) might partly stem from improvements in upstream capabilities (e.g., instruction-following).
Limitations and Future Work:
- The identifiability of HCA relies on assumptions (e.g., linear SCM, sufficient heterogeneity across domains).
- The MIC provides a quantitative measure of inexactness, but the model is still an approximation.
- Interpreting and intervening on latent factors remains a general challenge in CRL.
- The analysis was performed on a specific set of benchmarks and base models; generalizability to other setups needs further investigation.
- The paper suggests using more advanced causal inference tools (matching, stratification, doubly robust estimation) for future insights.
The paper provides a novel framework for understanding LM capabilities not just as a flat list of scores, but as an interconnected, hierarchical system. This causal perspective offers actionable insights for model development, evaluation, and fine-tuning by revealing how different abilities build upon each other.