Concept Component Analysis: A Principled Approach for Concept Extraction in LLMs

Published 28 Jan 2026 in cs.LG | (2601.20420v1)

Abstract: Developing human understandable interpretation of LLMs becomes increasingly critical for their deployment in essential domains. Mechanistic interpretability seeks to mitigate the issues through extracts human-interpretable process and concepts from LLMs' activations. Sparse autoencoders (SAEs) have emerged as a popular approach for extracting interpretable and monosemantic concepts by decomposing the LLM internal representations into a dictionary. Despite their empirical progress, SAEs suffer from a fundamental theoretical ambiguity: the well-defined correspondence between LLM representations and human-interpretable concepts remains unclear. This lack of theoretical grounding gives rise to several methodological challenges, including difficulties in principled method design and evaluation criteria. In this work, we show that, under mild assumptions, LLM representations can be approximated as a {linear mixture} of the log-posteriors over concepts given the input context, through the lens of a latent variable model where concepts are treated as latent variables. This motivates a principled framework for concept extraction, namely Concept Component Analysis (ConCA), which aims to recover the log-posterior of each concept from LLM representations through a {unsupervised} linear unmixing process. We explore a specific variant, termed sparse ConCA, which leverages a sparsity prior to address the inherent ill-posedness of the unmixing problem. We implement 12 sparse ConCA variants and demonstrate their ability to extract meaningful concepts across multiple LLMs, offering theory-backed advantages over SAEs.

Abstract PDF Upgrade to Chat

Summary

The paper presents Concept Component Analysis (ConCA) to recover log-posteriors representing latent concepts in LLMs, addressing limitations of SAEs.
It employs a latent variable model with sparsity priors and regularization techniques like LayerNorm and GroupNorm for stable feature extraction.
Empirical evaluations show ConCA achieves higher Pearson correlation, lower MSE, and robust performance across diverse tasks.

Concept Component Analysis: A Principled Approach for Concept Extraction in LLMs

The paper "Concept Component Analysis: A Principled Approach for Concept Extraction in LLMs" (2601.20420) introduces a novel framework for extracting interpretable concepts from LLMs called Concept Component Analysis (ConCA). The paper addresses the limitations of existing sparse autoencoders (SAEs) and proposes a more theoretically grounded method for concept extraction.

Introduction

The deployment of LLMs in critical applications necessitates mechanistic interpretability and the extraction of human-understandable concepts from LLM representations. Despite empirical successes, sparse autoencoders (SAEs), commonly used for this purpose, suffer from theoretical ambiguities regarding the correspondence between LLM internal representations and human-interpretable concepts. This paper proposes Concept Component Analysis (ConCA), a framework that leverages a latent variable model to approximate LLM representations as a linear mixture of log-posteriors over latent concepts. The aim is to recover these log-posteriors in an unsupervised manner, offering clear advantages over traditional SAE methods.

Theoretical Foundations

The paper establishes a theoretical foundation by introducing a latent variable model where text data generation is governed by discrete latent variables representing human-interpretable concepts. Under certain mild assumptions, it is shown that LLM representations can be expressed as a linear mixture of log-posteriors of these latent concepts. This insight motivates ConCA, which is designed to invert this linear mixture, aiming at recovering the log-posterior of each concept in an unsupervised manner.

Figure 1: We introduce a latent variable generative model in which observed the input context $\mathbf{x}$ and next token ${y}$ , arises from an unknown underlying process over latent concepts $\mathbf{z}$ .

Sparse ConCA: Design and Implementation

Recovering log-posteriors from LLM representations involves solving an ill-posed inverse problem. To address this, ConCA incorporates a sparsity prior, hypothesizing that only a subset of latent concepts are active for each text sample, and regularizes its solution space to mitigate the underdetermined nature of the task. Contrary to SAE methodologies, sparsity is placed on the exponentially transformed space to preserve concept activity levels. Various regularization techniques, such as LayerNorm and GroupNorm, are investigated for their effectiveness in stabilizing training and enhancing feature extraction.

Empirical Evaluation

ConCA is evaluated against SAE variants across multiple LLM architectures. The evaluation criteria include reconstruction loss and Pearson correlation between extracted features and supervised estimates of concept posteriors. Results demonstrate that ConCA outperforms SAEs in capturing and interpreting concept-level representations.

Figure 2: Comparison of SAE variants and the proposed ConCA variant across different scales and architectures. ConCA variants achieve higher Pearson correlation and lower MSE compared to SAE baselines.

Figure 3: Test AUC of SAE variants and the proposed ConCA variants under different few-shot settings across 113 datasets (Top), and out-of-distribution tasks across 8 datasets (Bottom), respectively.

Discussion and Implications

The implications of ConCA span theoretical and practical dimensions. Theoretically, it offers a principled framework with clear mathematical grounding for interpreting LLM representations. Practically, ConCA provides a robust tool for feature extraction that holds promise in enhancing the interpretability of LLMs, making them more applicable to real-world tasks requiring transparency and accountability.

Figure 4: Rank-based fraction of features exhibiting significant changes between counterfactual pairs for SAE and ConCA variants. ConCA shows smaller feature variations, indicating more stable feature responses under counterfactual pairs.

Conclusion

In conclusion, Concept Component Analysis (ConCA) stands as a significant advancement in the field of interpretable AI, providing both a theoretical foundation and an effective empirical approach to concept extraction in LLMs. Future directions could investigate alternative regularization strategies and explore the potential of ConCA in even larger-scale applications.