Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders (2507.23220v1)

Published 31 Jul 2025 in cs.CL and cs.LG

Abstract: Traditional topic models are effective at uncovering latent themes in large text collections. However, due to their reliance on bag-of-words representations, they struggle to capture semantically abstract features. While some neural variants use richer representations, they are similarly constrained by expressing topics as word lists, which limits their ability to articulate complex topics. We introduce Mechanistic Topic Models (MTMs), a class of topic models that operate on interpretable features learned by sparse autoencoders (SAEs). By defining topics over this semantically rich space, MTMs can reveal deeper conceptual themes with expressive feature descriptions. Moreover, uniquely among topic models, MTMs enable controllable text generation using topic-based steering vectors. To properly evaluate MTM topics against word-list-based approaches, we propose \textit{topic judge}, an LLM-based pairwise comparison evaluation framework. Across five datasets, MTMs match or exceed traditional and neural baselines on coherence metrics, are consistently preferred by topic judge, and enable effective steering of LLM outputs.

Summary

The paper introduces MTMs that leverage sparse autoencoders to extract interpretable feature directions for topic modeling and steer text generation.
The study presents three variants—mLDA, mETM, and mBERTopic—that redefine topic representations in a semantically rich feature space.
Empirical results show that MTMs outperform traditional baselines in coherence metrics and enable effective control of language model outputs.

Leveraging Sparse Autoencoders for Mechanistic Topic Models

The paper introduces Mechanistic Topic Models (MTMs), a novel approach to topic modeling that uses interpretable features learned by sparse autoencoders (SAEs) instead of traditional bag-of-words representations. These models define topics over a semantically rich space, enabling the discovery of deeper conceptual themes with expressive feature descriptions and controllable text generation through topic-based steering vectors. The paper presents three instantiations of MTMs: mechanistic LDA (mLDA), mechanistic ETM (mETM), and mechanistic BERTopic (mBERTopic).

Figure 1: Sample MTM topic outputs on the PoemSum dataset, demonstrating the model's ability to capture complex semantic content through interpretable, high-level features.

Key Concepts and Implementation

The core idea behind MTMs is to leverage the linear representation hypothesis, which suggests that high-level concepts in LLMs are encoded as linear directions within their internal activations. SAEs are used to extract these interpretable features from LLM activations. The activation vector $\a \in \mathbb{R}^{H}$ produced by a transformer model can be decomposed as:

$\a = \sum_{i=1}^W \alpha_i w_i + b$

where $b$ is an input-independent constant vector, the set $\{ w_1, w_2, \ldots, w_W \}$ consists of nearly orthogonal unit vectors, each vector $w_i$ corresponds to a human-interpretable feature, each scalar $\alpha_i$ represents the strength of feature $i$ in the activation vector $\a$, with sparse activation, and the number of vectors $W$ is typically much larger than their dimension $H$ .

The implementation involves several key steps:

Corpus Transformation: The corpus is transformed into SAE feature counts by counting how often each feature activates strongly within each document using a thresholding approach. $\tilde{c}_{d,i} = \sum_{j=1}^{N_{\text{tok} 1 \{ \alpha_{i}(\mathbf{a}_{d,j}) > q_i\}$ where $\alpha_i(\mathbf{a}_{d,j})$ is feature $i$ 's activation on token $j$ , and $q_i$ is the 80th percentile of feature $i$ 's activation distribution on the original SAE training data.
Topic-Feature Weight Learning: Topic-feature weights $\boldsymbol{\beta}_k \in \mathbb{R}_+^W$ and document-topic distributions $\boldsymbol{\theta}_d \in \Delta^{K-1}$ are learned using adaptations of existing topic modeling algorithms.
Topic Description Generation: Interpretable topic descriptions $t_k$ are generated from learned features, either through direct concatenation of top feature descriptions (TopFeatures) or by summarizing them using an LLM (Summarization).
Steering Vector Construction: Steering vectors $\mathbf{s}_k$ are constructed for controllable generation by weighting SAE feature directions according to their importance in topic $k$ . This is defined as:

$s_k = \frac{\sum_{i \in W} _{k,i}w_i}{\left\| \sum_{i \in W} _{k,i}w_i\right\|_2}$

where $s_k$ is a unit vector that points in the direction most characteristic of topic $k$ in the LLM's activation space.

MTM Variants

The paper explores three MTM variants:

mLDA: Adapts LDA by replacing topic-word distributions with distributions over SAE features.
mETM: Represents topics as vectors in LLM activation space, leveraging these activations to capture semantic relationships. The topic-feature distribution $_k \in [0, 1]^W$ is obtained by transforming the learned activation $v_k$ as in the SAE encoder: $_k = \sigma\left( W_{\text{in} v_k + b\right)$.
mBERTopic: Constructs document embeddings from SAE feature representations and uses clustering to discover topics. This is defined as:

$\tilde{e}_d = \frac{1}{N_{\text{tok} \sum_{i=1}^W \tilde{c}_{d,i} w_i$

Evaluation Methodology

The paper introduces topic judge, an LLM-based pairwise comparison evaluation framework. This framework addresses limitations of existing metrics by using pairwise comparisons to assess how well topics describe documents, enabling fair cross-vocabulary evaluation while capturing semantic nuance. The method involves performing pairwise comparisons between all model pairs, prompting an LLM judge to select which representation better captures the document. A Bradley-Terry model is then used to compute final scores.

Figure 2: Heatmap representations showing the similarity between mLDA and LDA topics across different datasets, highlighting novel topics found by mLDA on certain datasets.

Empirical Results

The paper evaluates MTMs on five datasets: 20NG, Bills, Wiki, GoEmotions, and PoemSum. The results demonstrate that:

MTMs match or exceed traditional and neural baselines on coherence metrics.
MTMs are consistently preferred by topic judge, particularly on challenging datasets like GoEmotions and PoemSum. (Table 1)
MTMs enable effective steering of LLM outputs, allowing for controllable text generation. (Table 2)
Figure 3: Average document log likelihood differences between on-topic and off-topic documents for mLDA, demonstrating successful topic ablation and steering.

The evaluation uses standard topic modeling metrics like coherence and topic diversity, along with the newly introduced topic judge metric. The paper also analyzes topic novelty by computing correlations between document-topic distributions of different models. The results show that MTMs discover new topics, particularly on datasets where semantic nuance is critical. The topic relevance win rate (TWR) exceeds 85% across all datasets, reaching 99% on Bills, indicating that steering shifts text generation toward intended topics.

Figure 4: Heatmap representations comparing mETM and ETM topics, and mBERTopic and BERTopic topics, illustrating topic similarity based on document proportions.

Discussion and Implications

The paper demonstrates that MTMs offer a practical improvement to topic modeling by leveraging SAE features to capture context and semantic nuance. The results suggest that interpretability tools like SAEs can be successfully repurposed for downstream tasks, provided that appropriate filtering steps are applied and the downstream task is robust to some degree of noise and mislabeling.

Figure 5: Document log-likelihood difference for mETM and mBERTopic, showing how steering affects the likelihood of on-topic versus off-topic documents.

The ability of MTMs to enable controllable text generation opens up new avenues for research in areas such as content creation and personalized LLMing. The strong performance of MTMs on abstract datasets like PoemSum highlights their potential for uncovering hidden themes and semantic relationships in complex text collections.

PDF Markdown

Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders (2507.23220v1)

Summary

Leveraging Sparse Autoencoders for Mechanistic Topic Models

Key Concepts and Implementation

MTM Variants

Evaluation Methodology

Empirical Results

Discussion and Implications

Follow-up Questions

Authors (8)

Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders (2507.23220v1)

Summary

Leveraging Sparse Autoencoders for Mechanistic Topic Models

Key Concepts and Implementation

MTM Variants

Evaluation Methodology

Empirical Results

Discussion and Implications

Follow-up Questions

Related Papers

Authors (8)