Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 31 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 11 tok/s Pro
GPT-5 High 9 tok/s Pro
GPT-4o 77 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 463 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

LLM Pretraining with Continuous Concepts (2502.08524v1)

Published 12 Feb 2025 in cs.LG and cs.CL

Abstract: Next token prediction has been the standard training objective used in LLM pretraining. Representations are learned as a result of optimizing for token-level perplexity. We propose Continuous Concept Mixing (CoCoMix), a novel pretraining framework that combines discrete next token prediction with continuous concepts. Specifically, CoCoMix predicts continuous concepts learned from a pretrained sparse autoencoder and mixes them into the model's hidden state by interleaving with token hidden representations. Through experiments on multiple benchmarks, including LLMing and downstream reasoning tasks, we show that CoCoMix is more sample efficient and consistently outperforms standard next token prediction, knowledge distillation and inserting pause tokens. We find that combining both concept learning and interleaving in an end-to-end framework is critical to performance gains. Furthermore, CoCoMix enhances interpretability and steerability by allowing direct inspection and modification of the predicted concept, offering a transparent way to guide the model's internal reasoning process.

Summary

  • The paper introduces CoCoMix, a pretraining method that augments token prediction with latent continuous concept modeling derived from a sparse autoencoder.
  • It employs a TopK mechanism with attribution scores to select key semantic concepts, boosting next token prediction in weak-to-strong supervision scenarios.
  • Empirical results demonstrate enhanced interpretability and steerability, with CoCoMix outperforming standard NTP and knowledge distillation baselines.

The paper introduces Continuous Concept Mixing (CoCoMix), a pretraining framework for LLMs that combines next token prediction with continuous concepts learned from a pretrained sparse autoencoder (SAE). The motivation stems from the limitations of relying solely on token-level perplexity for learning high-level reasoning and conceptual understanding. CoCoMix aims to bridge semantic abstraction and fine-grained token-level guidance by augmenting the next token prediction objective with explicit modeling of concepts in a latent representation space.

The paper details the CoCoMix methodology, which involves extracting semantic concepts using a pretrained SAE and selecting the most influential ones based on attribution scores. The model is trained to predict these selected concepts from its hidden state using a cross-entropy loss. The predicted concepts are then compressed into a single continuous concept vector, which is interleaved with token embeddings. The SAE decomposes the hidden state into multiple dimensions, each representing a distinct concept. The SAE uses a TopK activation function to enforce sparsity, isolating the most critical dimensions that explain the pretrained model's features. The reconstruction process of SAE is defined as:

htpre=E(htcon){h}_t^{\mathtt{pre}} = E\bigl(h_t^\mathtt{con}\bigr), ht=TopK(htpre)h_t = \mathrm{TopK}\bigl(h_t^{\mathtt{pre}}\bigr), h^tcon=D(ht)\widehat{h}_t^\mathtt{con} = D\bigl(h_t\bigr),

where

  • htcon{h}_t^{\mathtt{con}} is the pretrained model's hidden state at position tt,
  • EE is a linear encoder mapping Rdcon\mathbb{R}^{d_{\mathtt{con}}} to RC\mathbb{R}^{C},
  • DD is a linear decoder mapping RC\mathbb{R}^{C} to Rdcon\mathbb{R}^{d_\mathtt{con}},
  • CC is the dimension of the concept space,
  • htpre{h}_t^{\mathtt{pre}} is the pre-activation concept vector,
  • TopK()\mathrm{TopK}(\cdot) zeros out all but the largest KSAEK_{\mathtt{SAE}} entries,
  • h^tcon\widehat{h}_t^\mathtt{con} is the reconstruction.

The attribution score sts_t measures the influence of each concept on the output, based on the local linear approximation of the effect of changing the concept value:

$s_t = h_t^{\mathtt{pre}} \,\odot\, \nabla_{h_t} -\log f_{\mathtt{con}\big(x_{t+1}|D(h_{t}),x_{<t}\big)$,

where

  • sts_t is the attribution score,
  • htpreh_t^{\mathtt{pre}} is the pre-activation,
  • \odot denotes element-wise multiplication,
  • $f_{\mathtt{con}\big(x_{t+1}|D(h_{t}),x_{<t}\big)$ is the probability of predicting the next token xt+1x_{t+1} given the decoded concepts and previous tokens.

The indices of the concept that have a high attribution score are selected as discrete labels for concept prediction. A linear prediction head MM outputs logit lt=M(ht)RCl_t=M(h_t) \in \mathbb{R}^{C}, where hth_t is the model's hidden state. The cross-entropy loss Lconcept\mathcal{L}_{\mathtt{concept}} is defined as:

Lconcept(ht)=1KattriICE(lt,  i)\mathcal{L}_{\mathtt{concept}}(h_t) = \frac{1}{K_{\text{attr}}}\sum_{i \in \mathcal{I}} \mathrm{CE}\bigl(l_t,\; i\bigr),

where

  • I\mathcal{I} is the set of indices corresponding to the top KattrK_{\text{attr}} values of sts_t,
  • CE\mathrm{CE} is the cross-entropy.

The concept prediction logit ltl_t is sparsified using TopK\mathrm{TopK} activation and compressed into a continuous concept vector h^tRd\hat{h}_t \in \mathbb{R}^{d}:

h^t=TopK(lt)W+b\hat{h}_t = \mathrm{TopK}(l_t)W + b,

where

  • WRd×CW \in \mathbb{R}^{d \times C} and bRdb \in \mathbb{R}^{d} project the TopK-sparse vector to a dd-dimensional embedding.

The final training objective combines the standard next token prediction loss and the concept prediction term:

t=1T1logf(xt+1xt,h^t)+λLconcept(ht)\sum_{t=1}^{T-1} -\log f\bigl(x_{t+1}\mid x_{\leq t}, \hat{h}_{\leq t}\bigr) + \lambda \mathcal{L}_{\mathtt{concept}}(h_t),

where

  • λ\lambda is a tunable coefficient.

The paper presents an empirical evaluation of CoCoMix, examining its performance on next token prediction, weak-to-strong supervision, interpretability, and steerability. The training setup involves using a pretrained open-source SAE trained on the 124M-sized GPT-2. CoCoMix is trained with varying parameter sizes (68M, 386M, and 1.38B) and a context length of 1024. The OpenWebText dataset is used as the pretraining corpus. Baselines include the standard next token prediction (NTP) procedure and knowledge distillation (KD).

The results demonstrate that CoCoMix improves the performance of next token prediction, particularly in weak-to-strong supervision scenarios. CoCoMix achieves comparable performance to NTP with fewer training tokens and shows improvements in downstream tasks. In weak-to-strong supervision, concepts extracted from a smaller model are used to supervise the training of a larger model. CoCoMix also enhances interpretability and steerability, allowing for the analysis and control of the model's output generation. Additionally, the paper analyzes the effectiveness of each component of CoCoMix, including the attribution score, concept prediction, and mixing.

The paper compares CoCoMix with KD across multiple scenarios, including a stronger teacher model teaching a smaller student model, weak-to-strong supervision, and distribution shift. CoCoMix demonstrates improvements over KD in all model configurations, particularly in weak-to-strong supervision. A weight analysis of the compression layer reveals that CoCoMix learns to ignore ineffective concepts. Both concept prediction and concept insertion are critical for performance improvement. Comparing concept conditioning methods, the insertion method, which interleaves the continuous concept, performs better than intervention, which adds the concept vector to the hidden state. CoCoMix also outperforms pause tokens, indicating that the inserted continuous concepts contain useful information.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com