Base Models Know How to Reason, Thinking Models Learn When (2510.07364v1)
Abstract: Why do thinking LLMs like DeepSeek R1 outperform their base counterparts? Despite consistent performance gains, it remains unclear to what extent thinking models learn entirely new reasoning capabilities or repurpose pre-existing base model ones. In this work, we propose a hybrid model where we activate reasoning mechanisms in base models at the right time to elicit thinking-model-level reasoning chains, implying that thinking models exploit already existing capabilities. To ground our analysis, we introduce an unsupervised, bottom-up approach for uncovering human-interpretable reasoning behaviors in thinking models. This approach provides an unbiased method to discover reasoning behaviors without imposing manual or LLM-derived assumptions. Across three base and four thinking models, using GSM8K and MATH500, our hybrid model recovers up to 91% of the performance gap to thinking models without any weight updates while steering only 12% of tokens. Concretely, our empirical setup provides a simple, causal way to test the effectiveness of existing reasoning mechanisms in base models by invoking them directly and measuring the resulting task performance. More broadly, these results reframe our understanding of how thinking models are trained: pre-training is when models acquire most of their reasoning mechanisms, and post-training teaches efficient deployment of these mechanisms at the right time, enabling efficient use of their inference-time compute.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper asks a simple question: why do “thinking” LLMs (the ones that take extra time to write out step-by-step reasoning) beat regular “base” models on hard problems? The authors’ main idea is that base models already know how to reason, but thinking models learn when to use those skills. They show this by lightly “steering” base models at the right moments so they behave much more like thinking models—without retraining or changing the model’s weights.
Key Questions
- Do base models already have the core reasoning skills that thinking models use?
- If so, can we activate those skills at the right time to get thinking-model-level performance?
- What kinds of reasoning steps do thinking models use, and can we map them in a clear, human-friendly way?
Methods and Approach
Building a map of “reasoning habits”
Imagine a model’s chain-of-thought as a student solving problems step-by-step. Different sentences in its reasoning can play different roles—like setting a plan, checking a result, or backtracking and trying again. The authors build a “taxonomy” (a categorized list) of these reasoning habits by analyzing the model’s internal signals.
They do this using a tool called a Sparse Autoencoder (SAE). Think of an SAE like a smart sorter: it looks at the “brain activity” of the model while it reasons and finds a small number of key patterns that keep showing up. Each pattern roughly corresponds to a type of reasoning step. To make these patterns easy to understand, the authors:
- Group sentences into categories based on similar internal signals.
- Ask another AI to read sample sentences and describe each category in plain language (for example, “verification,” “set a subgoal,” “backtrack,” etc.).
- Check that the categories are high-quality: complete (cover what’s happening), consistent (easy to classify), and independent (not just duplicates of each other).
This bottom-up approach discovers what the model actually does, instead of imposing human guesses.
Teaching base models “when” to use those habits
The authors then try to activate the same reasoning habits in base models. They add small “steering vectors” to the base model’s activations—this is like turning tiny dials in the model’s brain to trigger a specific habit at the right moment.
Here’s the flow, using an everyday analogy:
- The base model writes most of the answer (like a student doing the work).
- A “thinking model activation classifier” watches the ongoing text and decides which habit would help next (like a coach whispering, “Now, double-check that step”).
- The system adds the corresponding steering vector to the base model briefly, nudging it to do that reasoning step.
- They only nudge on some tokens, not all, which keeps the intervention minimal.
Crucially, they don’t retrain the base model at all—no weight updates. They just add temporary nudges while it’s generating.
How they tested it
They tested across several base and thinking models, on two math benchmarks:
- GSM8K (grade-school level word problems).
- MATH500 (harder, competition-level problems).
They measured how much of the performance gap between the base and thinking models they could recover using the hybrid approach (base + minimal steering).
Main Findings
The following points summarize the most important results:
- Base models already have the core reasoning mechanisms inside them. With the right nudges at the right times, they can produce structured chains of thought similar to thinking models.
- The hybrid approach recovered up to 91% of the performance gap on MATH500 and up to around 82% on GSM8K, depending on the model pair.
- Steering was very sparse: on average, they only nudged about 6%–21% of the tokens per problem (often ~12%). In other words, a small number of well-timed nudges go a long way.
- Bigger base models benefited more than smaller ones, suggesting larger models have cleaner, more steerable reasoning directions.
- Ablation tests showed that two things matter:
- The learned steering directions are specific, not random.
- Timing is critical—activating the right habit at the right moment makes the difference.
Why This Matters
The big takeaway is about how thinking models are trained and why they work so well:
- Pre-training (the huge initial learning phase) appears to teach models most of the “how” of reasoning—the actual skills.
- Post-training (like reinforcement learning with verifiable rewards, RLVR) mainly teaches “when” to use those skills in a well-ordered sequence, so the model spends its extra thinking time efficiently.
- This reframes reasoning training: we may not need to teach brand-new skills; we might just need better ways to control timing and activation of existing skills.
Implications and Impact
- More efficient training: Instead of heavy retraining, we can use lightweight steering to activate reasoning habits already inside base models.
- Better distillation: When transferring reasoning to smaller models, focus on teaching “when to use” each habit, not just copying answers.
- Debugging and improvement: If a model’s reasoning fails, we can target specific habits (like verification or backtracking) and strengthen or trigger them more reliably.
- Practical systems: Hybrid setups can leverage base models for speed and cost, while using minimal guidance to reach near-thinking-model performance.
In short, the paper suggests that base models know how to reason; thinking models learn when to reason. Unlocking those “when” skills—through smart, sparse steering—can close most of the gap without retraining.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, concrete list of what remains missing, uncertain, or unexplored, framed to be actionable for future work.
- Reliance on a thinking-model oracle during inference: The hybrid uses a thinking model to (a) classify when to activate a mechanism (via SAE activation/gating) and (b) select steered tokens by minimizing thinking-model perplexity. Does gap recovery persist if the oracle is removed or replaced by a small learned gate trained only on base-model signals?
- Baseline controls for oracle usage: Compare hybrid gains against simpler inference-time guidance baselines (e.g., logit interpolation, shallow fusion with the thinking model, reranking via thinking-model perplexity) without steering to isolate the unique contribution of activation steering.
- Independence from thinking-model style: Steering vectors are trained to increase likelihood of thinking-model completions and decrease base-model completions. How much of the observed gains reflect style imitation (persona, phrasing) versus mechanism activation? Devise style-controlled evaluations and style-invariant objectives.
- Token-selection without thinking-model perplexity: The current selection of steered tokens and coefficients depends on thinking-model perplexity. Can coefficients and windows be chosen via base-model-only signals (e.g., confidence/entropy, internal uncertainty estimators, calibration errors)?
- Causal validation of mechanism claims: Beyond performance, show that steering truly induces specific mechanisms (e.g., verification, backtracking) via targeted unit tests, causal tracing/path patching, and counterfactual evaluations where a mechanism is necessary and sufficient.
- Faithfulness of chains: Do steered chains reflect the model’s true internal computation? Evaluate CoT faithfulness (e.g., via established faithfulness metrics, contradiction checks, and intervention-based tests) to ensure improved accuracy isn’t driven by unfaithful explanations.
- Human-grounded taxonomy validation: Taxonomy quality (completeness, consistency, independence) is judged by an LLM. Add human annotations, inter-annotator agreement, and external task-grounded measures to validate that categories are meaningful to humans.
- Stability of discovered categories: Assess taxonomy robustness across random seeds, datasets (beyond MMLU-Pro), prompt formats, aggregation schemes (token vs sentence vs clause), and SAE hyperparameters (latent size, k in Top-K, sparsity targets).
- Layer and depth sensitivity: Steering effective layer is fixed near ~37% depth; taxonomy layer is separately chosen. Systematically map where mechanisms are encoded and most causally steerable across layers, and whether this varies by model family and scale.
- Top-K SAE design choices: The restricted decoder and very low latent dimensionality are assumed to force “mechanism-level” features. Quantify trade-offs (elbow points) and test whether larger dictionaries uncover additional mechanisms or reduce category entanglement.
- Generalization beyond math: Results are shown on GSM8K and MATH500. Evaluate on diverse domains (code, science QA, logical reasoning, commonsense, multi-hop QA, multilingual tasks) to test whether the “base knows how; thinking learns when” claim generalizes.
- Cross-model universality of mechanisms: Are discovered categories shared across architectures (Llama vs Qwen), sizes (1.5B–70B+), training regimes (SFT, RLVR, distillation), and tokenizers? Quantify overlap (e.g., subspace alignment/CKA) and model-specific mechanisms.
- Small-model failure modes: Hybrid gains are minimal for smaller bases (e.g., 1.5B, 8B). Diagnose whether failures stem from noisier directions, missing mechanisms, capacity limits, or gating errors; test if additional or deeper-layer steering helps.
- Compute and latency overhead: The hybrid incurs extra forward passes for SAE activations and coefficient/window sweeps plus thinking-model scoring. Provide wall-clock, FLOPs, and memory costs vs base-only and full thinking-model inference.
- Steering sparsity vs efficacy: You steer ~6–21% of tokens. Characterize the marginal benefit of steering fraction, sensitivity to window length, and timing errors. Can steering be made sparser with minimal loss (e.g., via confidence-triggered gates)?
- Alternative gating signals: Replace the SAE oracle with learned gates on base-model activations, uncertainty estimators, error predictors, or verifier signals that do not require the thinking model, and compare gap recovery.
- Verifier-driven guidance without thinking model: For QwQ-like settings, can lightweight verifier signals (stepwise checks) replace thinking-model perplexity in both timing and coefficient selection?
- Stronger ablations: Current ablations (only-bias, random-firing, random-vectors) are informative but limited. Add: (a) no-thinker-perplexity selection, (b) wrong-category steering, (c) shuffled sentence labels, (d) steering at wrong layers, (e) steering into orthogonal random subspaces.
- Style vs mechanism disentanglement metrics: Develop automatic diagnostics (e.g., style classifiers, lexical diversity, syntactic templates, verbosity controls) and mechanism probes (e.g., explicit backtracking detectors) to quantify disentanglement.
- Statistical rigor and uncertainty: Report confidence intervals, bootstrap variability, and significance tests on benchmark results, and analyze per-category variance in steering effectiveness across problems.
- Domain shift in taxonomy training: The taxonomy is trained on MMLU-Pro traces but evaluated on math tasks. Quantify transfer, and test task-specific vs task-agnostic taxonomies.
- Safety and robustness: Investigate whether steering increases hallucination risks, brittleness to adversarial prompts, or systematic biases; evaluate on safety benchmarks and out-of-distribution adversarial sets.
- Data contamination checks: Given high absolute accuracies, verify benchmark contamination for both base and thinking models and report training data overlap analyses.
- Scaling laws for “when vs how”: The central claim implies different scaling for capabilities (how) vs orchestration (when). Track these across pretraining snapshots, SFT, and RLVR checkpoints to test acquisition timing empirically.
- Open-source practicality: The approach assumes access to internal activations and repeated passes. Assess feasibility for closed models, quantized deployments, streaming/online decoding, and memory-constrained settings.
- Mechanism composition: Do multiple mechanisms compose linearly or interfere? Study multi-vector steering, additive vs sequential application, and non-linear interactions across categories.
- Cross-lingual and multimodal extensions: Test whether the same mechanism taxonomy and steering approach holds for non-English inputs and for modalities like code, images, or tool-use.
- Interpretability-grounded unit tests: Curate a public suite of “mechanism requisite” tasks (e.g., forced backtracking, required verification) to benchmark mechanism-specific steering effectiveness independently of overall accuracy.
- Fairness against compact distillation: Compare the hybrid to small learned controllers that map base activations to gate signals (few million parameters), isolating the advantage of activation steering versus light parameter updates.
- Theoretical underpinnings: Provide a formal account linking linear directions in activation space to causal mechanism execution (beyond the linear representation hypothesis), and conditions under which such directions should exist and transfer across tasks.
Practical Applications
Practical Applications of “Base Models Know How to Reason, Thinking Models Learn When”
Below we distill actionable, real-world applications arising from the paper’s findings and methods. We group them by deployment horizon and note sector fit, potential tools/workflows, and feasibility assumptions that could affect rollout.
Immediate Applications
- Sector: Software/AI Infrastructure — Reasoning Scheduler for Cost-Effective Inference
- Description: Deploy a “hybrid” decoding layer that monitors base-model activations and selectively injects steering vectors to activate verification, backtracking, subgoal-setting, etc. Gains approach reasoning-model performance while steering only ~6–21% of tokens and without any weight updates.
- Tools/Workflows: “Reasoning Scheduler SDK” (activation hooks + vector bank + SAE classifier), “Perplexity gate” to prevent out-of-distribution steps, CI pipeline A/B tests on GSM8K/MATH-style tasks.
- Value: Reduce inference cost/latency compared to full chain-of-thought or thinking-model generation; enable on-demand deeper reasoning only when needed.
- Assumptions/Dependencies: Access to model internals for activation injection (open-weight or vendor-supported hooks); availability of an SAE taxonomy and trained steering vectors aligned to the target domain.
- Sector: Education — Tutor Systems with Selective, Teachable Reasoning
- Description: Instrument base models to trigger “explain-when-needed” reasoning (e.g., verifying intermediate steps, decomposition) for math/homework help. Surface reasoning-category tags to teach metacognitive strategies (e.g., “Now I’m verifying,” “Setting a subgoal”).
- Tools/Workflows: Student-facing “reasoning overlay,” teacher dashboards summarizing which mechanisms were used per problem.
- Value: More faithful, structured feedback for students; improved transparency and learning of problem-solving habits.
- Assumptions/Dependencies: Generalization beyond math may require domain-specific taxonomies; human review for pedagogy and safety.
- Sector: Software Engineering — Code Assistants with Verification/Backtracking Hooks
- Description: Trigger verification and test-generation behaviors only on risky edits, ambiguous requirements, or high-complexity code regions. Steer toward structured planning and rollback mechanisms rather than free-form speculation.
- Tools/Workflows: “Risk-aware steering” tied to static analysis or change-size signals; integration with unit-test generation.
- Value: Fewer erroneous changes, higher trust; cost savings by thinking selectively rather than always running expensive agents.
- Assumptions/Dependencies: Calibrated triggers (e.g., complexity heuristics); robust vectors for coding domains; careful evaluation to avoid over-steering.
- Sector: Customer Support/Enterprise Chat — Adaptive Compute Budgeting
- Description: Use SAE-based detectors to apply deeper reasoning (decomposition, lookup-planning, double-checking) only for complex tickets; avoid chain-of-thought verbosity on simple FAQs.
- Tools/Workflows: “Compute governor” that adjusts the steering window and strength; SLAs mapping issue types to reasoning budgets.
- Value: Reduced average handling time and compute cost with maintained quality on hard cases.
- Assumptions/Dependencies: Accurate complexity detection; domain-tuned taxonomies; routing for compliance-sensitive conversations.
- Sector: Compliance/Policy/Regulatory Audits — Reasoning Audit Trails
- Description: Log which reasoning mechanisms were activated and when, to provide structured transparency for regulated decisions (e.g., finance summaries, procurement notes).
- Tools/Workflows: “Reasoning profiler” that tags sentence-level behavior and exports an audit record; dashboards with mechanism-frequency and timing.
- Value: Traceability without exposing raw chain-of-thought; supports internal QA and external audits.
- Assumptions/Dependencies: Organizational acceptance of activation-level logging; consistent mapping from mechanisms to understandable labels.
- Sector: Safety & Reliability — Hallucination Mitigation via Verification Steering
- Description: Proactively increase activation of verification/cross-check mechanisms for claims, numbers, and citations; de-activate rumination for time/budget constraints.
- Tools/Workflows: Rule-based triggers (e.g., presence of numbers/URLs) for selective verification; thresholds for stopping repetitive reasoning.
- Value: Better factual reliability with minimal extra compute.
- Assumptions/Dependencies: Robust detection of “claims needing verification”; domain coverage of steering vectors.
- Sector: Research/Academia — Taxonomy-Driven Interpretability and Benchmarking
- Description: Use the unsupervised SAE pipeline to derive human-interpretable reasoning taxonomies; evaluate with the paper’s completeness/consistency/independence metrics; compare models on “when-to-think” skill rather than only accuracy.
- Tools/Workflows: “SAE Taxonomy Builder,” standard prompts for cluster naming and evaluation, new leaderboards tracking gap recovery (hybrid vs. base vs. thinking).
- Value: Sharper diagnostics for reasoning quality and training effects (pretraining vs. post-training).
- Assumptions/Dependencies: LLM-as-a-judge metrics align with human judgment; reproducible layer/k settings across models and languages.
- Sector: Edge/On-Device Assistants — Battery- and Privacy-Aware Reasoning
- Description: Run base models on-device with lightweight “when-to-think” control to invoke costly reasoning sparingly; keep data local while retaining strong performance on hard tasks.
- Tools/Workflows: Precomputed steering vectors; cached SAE classifier; adaptive steering window to respect power constraints.
- Value: Better UX on mobile/embedded with constrained compute.
- Assumptions/Dependencies: Efficient activation access on device; memory footprint for vector banks; thermal/power budgets.
- Sector: Data/Training Operations — More Efficient Post-Training
- Description: Replace broad SFT/RL passes with policy learning for “when to activate” existing mechanisms; distill timing from a teacher thinking model into a compact classifier or small policy head.
- Tools/Workflows: “When-to-think” distillation recipes; verifier-driven signals aligned to mechanism activation rather than outcome-only rewards.
- Value: Lower training cost and faster iteration for reasoning-capable models.
- Assumptions/Dependencies: Availability of teacher traces; reliable mapping from timing signals to activation control at inference.
Long-Term Applications
- Sector: Cross-Vendor AI Platforms — Standardized Reasoning Control Plane
- Description: An industry standard API exposing “reasoning controls” (activation schedules, mechanism toggles, budgets) across model providers, akin to a hardware abstraction layer for cognition.
- Tools/Workflows: “Reasoning Control API,” portable vector schemas, capability discovery probes.
- Value: Interoperability, composability, and governance over inference-time compute across heterogeneous models.
- Assumptions/Dependencies: Vendor buy-in; safe exposure of activation-level controls; sandboxed interfaces to prevent misuse.
- Sector: Regulation & Governance — Mechanism-Level Assurance and Oversight
- Description: Require or certify “reasoning audit logs” (mechanism activation timelines) for high-stakes decisions; third-party tools to verify that verification/backtracking were applied appropriately.
- Tools/Workflows: Audit frameworks; red-team protocols tied to mechanism coverage; regulatory reporting standards.
- Value: Measurable, testable procedural safeguards beyond outcome metrics.
- Assumptions/Dependencies: Policy consensus that activation logs are meaningful and not privacy-invasive; standards for acceptable mechanism use.
- Sector: Hardware/Systems — Compute-Gated Architectures for Selective Reasoning
- Description: Accelerator support for fast injection of sparse steering vectors and dynamic layer gating; scheduler kernels that apply vectors within token windows.
- Tools/Workflows: “Activation injection primitives” in inference runtimes; co-designed memory layouts for vector banks.
- Value: Lower latency/energy for hybrid reasoning; better QoS under bursty workloads.
- Assumptions/Dependencies: Ecosystem support (compilers, drivers); demand for fine-grained compute governance.
- Sector: Personalized AI — User-Tunable Cognitive Styles
- Description: Profiles that bias toward concise answers vs. thorough verification; adjustable tolerance for backtracking or exploration; per-user “when-to-think” policies.
- Tools/Workflows: Preference learning mapping user goals to mechanism schedules; UX controls for cognition budgets.
- Value: Better alignment with user intent and context (e.g., quick chat vs. deep analysis).
- Assumptions/Dependencies: Safe personalization without enabling evasive or manipulative behaviors; robust defaults.
- Sector: Safety & Alignment — Externalized Reasoning Oversight
- Description: Supervisory systems that watch mechanism activations for warning patterns (e.g., deceptive planning), and intervene at the activation level to suppress or reroute.
- Tools/Workflows: “Cognitive guardrails” tied to sentinel detectors; circuit-level red-teaming and automated rollback.
- Value: Reduced risk from emergent undesirable behaviors; intervention without full retraining.
- Assumptions/Dependencies: Reliable detection of problematic patterns; low false positives; defense against adaptive circumvention.
- Sector: Multi-Agent/Tool Use — Orchestrated “When-to-Think” Across Agents/Tools
- Description: Coordinators allocate deep-reasoning turns to specific agents or trigger external tools only when mechanism detectors flag need (e.g., verification triggers a retrieval or solver).
- Tools/Workflows: “Cognition orchestrator” that binds reasoning mechanisms to tool/action calls; budget-aware scheduling.
- Value: Better throughput and solution quality in complex pipelines (RAG, planning, codegen).
- Assumptions/Dependencies: Stable mappings from mechanisms to tool affordances; latency/availability of tools.
- Sector: Healthcare/Legal/Finance (High-Stakes) — Mechanism-Guided Decision Support
- Description: Embed mandated verification/backtracking steps before recommendations; produce mechanism-level summaries for clinical or legal review.
- Tools/Workflows: Institutional policies encoded as mechanism schedules; dashboards for reviewers showing which checks ran.
- Value: Trustworthy assistance with clear procedural trace; potential to meet audit/compliance needs.
- Assumptions/Dependencies: Domain-specific validation, liability frameworks, and rigorous human oversight; integration with secure data systems.
- Sector: Training Paradigms — “Pretrain What, Post-Train When”
- Description: Curriculum and RLVR designs that explicitly separate capability acquisition (pretraining) from scheduling/orchestration (post-training), potentially reducing RL cost and data needs.
- Tools/Workflows: Small “policy head” that governs activation timing; self-play/self-critique that rewards proper sequence of mechanisms.
- Value: More scalable and interpretable training for reasoning models.
- Assumptions/Dependencies: Stable interfaces between base representations and policy; transfer across tasks and languages.
- Sector: Public Transparency — Consumer-Facing Reasoning Labels
- Description: UIs that display short, non-invasive badges like “Verified,” “Backtracked,” “Decomposed” instead of exposing raw chain-of-thought.
- Tools/Workflows: Lightweight labeling from activation logs; opt-in user controls.
- Value: Increases user trust and comprehension without privacy or IP leakage from full CoT.
- Assumptions/Dependencies: Clear, non-misleading mappings from activations to labels; user comprehension research.
- Sector: Cross-Domain Generalization — Domain-Specific Taxonomies
- Description: Build and maintain SAE-derived taxonomies for law, medicine, finance, engineering; share reusable “vector banks” per domain.
- Tools/Workflows: Domain-tailored SAE training pipelines; versioned vector registries; governance around sensitive capabilities.
- Value: Port the hybrid method beyond math and coding to complex, real-world tasks.
- Assumptions/Dependencies: Availability of high-quality domain traces; careful safety review; legal/IP constraints.
Notes on Feasibility and Risks Across Applications
- The method assumes the base model’s latent mechanisms are steerable and that activation directions are causal; smaller/weaker models may have noisier directions, yielding smaller gains.
- Access to internal activations is required; closed models may need vendor APIs or on-prem open-weight alternatives.
- SAE/LLM-as-a-judge evaluations should be validated with human studies to avoid taxonomy drift or mislabeling.
- Steering can introduce distribution shift if over-applied; perplexity gating and selective windows mitigate this but require tuning.
- Safety: Activation-level control could be misused to elicit undesired capabilities; guardrails and policy constraints must accompany deployment.
These applications leverage the paper’s central insight: pretraining largely learns the “how” of reasoning, while post-training (and inference-time control) can focus on “when” to deploy those mechanisms. This reframing enables immediate cost/performance wins via hybrid steering and sets a path for longer-term standards, governance, and system co-design around controllable reasoning.
Glossary
- Ablation paper: An experimental technique that removes or isolates components of a system to assess their contribution to performance. "To assess the hybrid model's components, we ablate three factors: the specificity of the learned steering vectors, the timing of their application, and the contribution of the bias vector."
- Activation space: The high-dimensional vector space formed by a model’s internal activations, where directions can represent concepts or behaviors. "steering vectors: directions in activation space that, when added to intermediate activations, induce target behaviors"
- Bias vector: A learned direction added during generation to capture general rollout-style similarities (e.g., tone or format) across examples. "In addition to category-specific steering vectors, we train a general bias vector using a randomly sampled set of thinking rollouts as the target completion."
- Chain-of-thought prompting: A prompting strategy that encourages the model to generate step-by-step reasoning before an answer. "For both the base-only and hybrid models, we use the same chain-of-thought prompting format (see \cref{app:hybrid-details}), so improvements cannot be attributed to prompting differences."
- Decoder space: The representational subspace used by an autoencoder’s decoder; constraining it can force learning of core features. "Using a restricted decoder space, we force the SAE to learn the subspace components that best explain the variance of our sentence activations"
- Distillation: A training method where a smaller or simpler model is trained to mimic a stronger teacher model. "including models trained with distillation (DeepSeek-R1-Distill series)"
- Elbow (in model selection): A point where increasing the number of clusters/features yields diminishing gains, guiding a practical choice of model size. "we find ``elbow'' scores at cluster sizes between $10$ and $20$, suggesting that reasoning mechanisms are reasonably well represented using $\mathbf{10$ to categories}"
- Gap recovery: The fraction of performance improvement achieved by a method relative to the gap between a baseline and a stronger model. "The best gap recovery $((\mathrm{Acc}_{\text{hybrid}-\mathrm{Acc}_{\text{base})/(\mathrm{Acc}_{\text{thinking}-\mathrm{Acc}_{\text{base}))$ achieved by the hybrid model is on GSM8K (Qwen2.5-32B with DeepSeek-R1-Distill) and an impressive on MATH500 (Qwen2.5-32B with QwQ-32B)."
- Grid search: A systematic hyperparameter sweep across predefined settings to find an optimal configuration. "We performed an extensive grid search across these five models, using $6$ distributed layers and cluster sizes (ranging from $5$ to $50$ categories with increments of $5$) to identify the optimal taxonomy configuration."
- Hybrid model: A system that uses a base model for generation while selectively applying steering guided by another model to induce desired behaviors. "Our hybrid model combines the reasoning skills of the base model with the capacity to selectively apply steering vectors at appropriate points in the generation process."
- Inference-time compute: Additional computation performed during generation to improve reasoning quality, independent of training updates. "Thinking models, also known as reasoning models, or models using inference-time compute, are a type of LLM designed to generate long chains of reasoning before arriving at a final answer."
- Latent dimension: The size of the internal feature vector in an autoencoder; controls the number of learned features/mechanisms. "we deliberately restrict the latent dimension to be in the range "
- Linear representation hypothesis: The idea that concepts/behaviors are encoded as linear directions in neural activation space. "This leverages the linear representation hypothesis, which posits that certain concepts and behaviors in neural networks are represented as directions in activation space."
- MATH500: A benchmark of 500 competition-level math problems used to evaluate reasoning capabilities. "We evaluate performance on two mathematical reasoning benchmarks of increasing difficulty: GSM8K \citep{gsm8k} for grade-school math problems and MATH500 \citep{math} for competition-level mathematics."
- Min-max normalization: A scaling method that maps values to a fixed range (often [0,1]) to compare metrics across settings. "For comparison across configurations, we apply min-max normalization within each model."
- MMLU-Pro: A robust, challenging multitask dataset used to elicit and analyze reasoning traces. "We train our Top-K Sparse Autoencoders (SAEs) on sentence-level activations extracted from reasoning traces generated on prompts from MMLU-Pro \citep{wang2024mmluprorobustchallengingmultitask}"
- Perplexity: A measure of how well a probability model predicts a sample; lower perplexity indicates better fit. "select the steered token with the lowest perplexity according to the thinking model"
- Reinforcement Learning from Verifier Rewards (RLVR): A training method that uses stepwise signals from automated verifiers to shape intermediate reasoning. "Similarly, QwQ-32B is a LLM trained with RLVR, which optimizes the model with stepwise signals from automated verifiers rather than outcome-only rewards, explicitly shaping intermediate reasoning."
- Sentence-level activations: Aggregated or averaged model activations over a sentence, used to analyze higher-level reasoning steps. "We train our Top-K Sparse Autoencoders (SAEs) on sentence-level activations extracted from reasoning traces"
- Sparse Autoencoder (SAE): An autoencoder that enforces sparsity in its latent representation to learn interpretable features. "Sparse Autoencoders (SAEs) \citep{sparseAutoEncoders, efficientSparseCoding} have gained widespread popularity in recent years due to their ability to decompose LLM activations into interpretable features"
- Steering vector: A learned direction added to model activations to causally induce a specific behavior or mechanism. "We control the base model with steering vectors: directions in activation space that, when added to intermediate activations, induce target behaviors"
- Thoughtology: A proposed taxonomy and analysis framework for reasoning building blocks in thinking models. "\citet{marjanoviÄ2025deepseekr1thoughtologyletsthink} introduce a ``thoughtology'' of DeepSeek-R1, analyzing reasoning building blocks across chain length and cognitive style"
- Top-K sparsity: A constraint that keeps only the K largest components active in a latent representation to enforce sparsity. "the parameter in top- sparsity constrains how many reasoning mechanisms can be simultaneously active in a single sentence."
- Top-K Sparse Autoencoder (Top-K SAE): An SAE variant that enforces sparsity by retaining only the K highest-magnitude latent features. "Top-K SAEs \citep{kSAEs, gao2024scalingSAEs} are a variant that enforces sparsity by keeping only the largest magnitude components of the latent representation, creating a more interpretable and computationally efficient decomposition."
Collections
Sign up for free to add this paper to one or more collections.