Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
95 tokens/sec
Gemini 2.5 Pro Premium
52 tokens/sec
GPT-5 Medium
20 tokens/sec
GPT-5 High Premium
28 tokens/sec
GPT-4o
100 tokens/sec
DeepSeek R1 via Azure Premium
98 tokens/sec
GPT OSS 120B via Groq Premium
459 tokens/sec
Kimi K2 via Groq Premium
197 tokens/sec
2000 character limit reached

Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders (2507.23220v1)

Published 31 Jul 2025 in cs.CL and cs.LG

Abstract: Traditional topic models are effective at uncovering latent themes in large text collections. However, due to their reliance on bag-of-words representations, they struggle to capture semantically abstract features. While some neural variants use richer representations, they are similarly constrained by expressing topics as word lists, which limits their ability to articulate complex topics. We introduce Mechanistic Topic Models (MTMs), a class of topic models that operate on interpretable features learned by sparse autoencoders (SAEs). By defining topics over this semantically rich space, MTMs can reveal deeper conceptual themes with expressive feature descriptions. Moreover, uniquely among topic models, MTMs enable controllable text generation using topic-based steering vectors. To properly evaluate MTM topics against word-list-based approaches, we propose \textit{topic judge}, an LLM-based pairwise comparison evaluation framework. Across five datasets, MTMs match or exceed traditional and neural baselines on coherence metrics, are consistently preferred by topic judge, and enable effective steering of LLM outputs.

Summary

  • The paper introduces MTMs that leverage sparse autoencoders to extract interpretable feature directions for topic modeling and steer text generation.
  • The study presents three variants—mLDA, mETM, and mBERTopic—that redefine topic representations in a semantically rich feature space.
  • Empirical results show that MTMs outperform traditional baselines in coherence metrics and enable effective control of language model outputs.

Leveraging Sparse Autoencoders for Mechanistic Topic Models

The paper introduces Mechanistic Topic Models (MTMs), a novel approach to topic modeling that uses interpretable features learned by sparse autoencoders (SAEs) instead of traditional bag-of-words representations. These models define topics over a semantically rich space, enabling the discovery of deeper conceptual themes with expressive feature descriptions and controllable text generation through topic-based steering vectors. The paper presents three instantiations of MTMs: mechanistic LDA (mLDA), mechanistic ETM (mETM), and mechanistic BERTopic (mBERTopic). Figure 1

Figure 1: Sample MTM topic outputs on the PoemSum dataset, demonstrating the model's ability to capture complex semantic content through interpretable, high-level features.

Key Concepts and Implementation

The core idea behind MTMs is to leverage the linear representation hypothesis, which suggests that high-level concepts in LLMs are encoded as linear directions within their internal activations. SAEs are used to extract these interpretable features from LLM activations. The activation vector $\a \in \mathbb{R}^{H}$ produced by a transformer model can be decomposed as:

$\a = \sum_{i=1}^W \alpha_i w_i + b$

where bb is an input-independent constant vector, the set {w1,w2,,wW}\{ w_1, w_2, \ldots, w_W \} consists of nearly orthogonal unit vectors, each vector wiw_i corresponds to a human-interpretable feature, each scalar αi\alpha_i represents the strength of feature ii in the activation vector $\a$, with sparse activation, and the number of vectors WW is typically much larger than their dimension HH.

The implementation involves several key steps:

  1. Corpus Transformation: The corpus is transformed into SAE feature counts by counting how often each feature activates strongly within each document using a thresholding approach. $\tilde{c}_{d,i} = \sum_{j=1}^{N_{\text{tok} 1 \{ \alpha_{i}(\mathbf{a}_{d,j}) > q_i\}$ where αi(ad,j)\alpha_i(\mathbf{a}_{d,j}) is feature ii's activation on token jj, and qiq_i is the 80th percentile of feature ii's activation distribution on the original SAE training data.
  2. Topic-Feature Weight Learning: Topic-feature weights βkR+W\boldsymbol{\beta}_k \in \mathbb{R}_+^W and document-topic distributions θdΔK1\boldsymbol{\theta}_d \in \Delta^{K-1} are learned using adaptations of existing topic modeling algorithms.
  3. Topic Description Generation: Interpretable topic descriptions tkt_k are generated from learned features, either through direct concatenation of top feature descriptions (TopFeatures) or by summarizing them using an LLM (Summarization).
  4. Steering Vector Construction: Steering vectors sk\mathbf{s}_k are constructed for controllable generation by weighting SAE feature directions according to their importance in topic kk. This is defined as:

    $s_k = \frac{\sum_{i \in W} _{k,i}w_i}{\left\| \sum_{i \in W} _{k,i}w_i\right\|_2}$

    where sks_k is a unit vector that points in the direction most characteristic of topic kk in the LLM's activation space.

MTM Variants

The paper explores three MTM variants:

  • mLDA: Adapts LDA by replacing topic-word distributions with distributions over SAE features.
  • mETM: Represents topics as vectors in LLM activation space, leveraging these activations to capture semantic relationships. The topic-feature distribution k[0,1]W_k \in [0, 1]^W is obtained by transforming the learned activation vkv_k as in the SAE encoder: $_k = \sigma\left( W_{\text{in} v_k + b\right)$.
  • mBERTopic: Constructs document embeddings from SAE feature representations and uses clustering to discover topics. This is defined as:

    $\tilde{e}_d = \frac{1}{N_{\text{tok} \sum_{i=1}^W \tilde{c}_{d,i} w_i$

Evaluation Methodology

The paper introduces topic judge, an LLM-based pairwise comparison evaluation framework. This framework addresses limitations of existing metrics by using pairwise comparisons to assess how well topics describe documents, enabling fair cross-vocabulary evaluation while capturing semantic nuance. The method involves performing pairwise comparisons between all model pairs, prompting an LLM judge to select which representation better captures the document. A Bradley-Terry model is then used to compute final scores. Figure 2

Figure 2: Heatmap representations showing the similarity between mLDA and LDA topics across different datasets, highlighting novel topics found by mLDA on certain datasets.

Empirical Results

The paper evaluates MTMs on five datasets: 20NG, Bills, Wiki, GoEmotions, and PoemSum. The results demonstrate that:

  • MTMs match or exceed traditional and neural baselines on coherence metrics.
  • MTMs are consistently preferred by topic judge, particularly on challenging datasets like GoEmotions and PoemSum. (Table 1)
  • MTMs enable effective steering of LLM outputs, allowing for controllable text generation. (Table 2) Figure 3

    Figure 3: Average document log likelihood differences between on-topic and off-topic documents for mLDA, demonstrating successful topic ablation and steering.

The evaluation uses standard topic modeling metrics like coherence and topic diversity, along with the newly introduced topic judge metric. The paper also analyzes topic novelty by computing correlations between document-topic distributions of different models. The results show that MTMs discover new topics, particularly on datasets where semantic nuance is critical. The topic relevance win rate (TWR) exceeds 85% across all datasets, reaching 99% on Bills, indicating that steering shifts text generation toward intended topics. Figure 4

Figure 4

Figure 4: Heatmap representations comparing mETM and ETM topics, and mBERTopic and BERTopic topics, illustrating topic similarity based on document proportions.

Discussion and Implications

The paper demonstrates that MTMs offer a practical improvement to topic modeling by leveraging SAE features to capture context and semantic nuance. The results suggest that interpretability tools like SAEs can be successfully repurposed for downstream tasks, provided that appropriate filtering steps are applied and the downstream task is robust to some degree of noise and mislabeling. Figure 5

Figure 5

Figure 5: Document log-likelihood difference for mETM and mBERTopic, showing how steering affects the likelihood of on-topic versus off-topic documents.

The ability of MTMs to enable controllable text generation opens up new avenues for research in areas such as content creation and personalized LLMing. The strong performance of MTMs on abstract datasets like PoemSum highlights their potential for uncovering hidden themes and semantic relationships in complex text collections.