Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

Published 30 Dec 2025 in cs.CL, cs.AI, and cs.LG | (2512.23988v1)

Abstract: Despite the growing reasoning capabilities of recent LLMs, their internal mechanisms during the reasoning process remain underexplored. Prior approaches often rely on human-defined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner. However, such methods are limited, as it is infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space. In this work, we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors. By segmenting chain-of-thought traces into sentence-level 'steps' and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking. Visualization and clustering analyses show that these behaviors occupy separable regions in the decoder column space. Moreover, targeted interventions on SAE-derived vectors can controllably amplify or suppress specific reasoning behaviors, altering inference trajectories without retraining. Beyond behavior-specific disentanglement, SAEs capture structural properties such as response length, revealing clusters of long versus short reasoning traces. More interestingly, SAEs enable the discovery of novel behaviors beyond human supervision. We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space. These findings underscore the potential of unsupervised latent discovery for both interpreting and controllably steering reasoning in LLMs.

Abstract PDF Upgrade to Chat

Summary

The paper introduces RISE, a sparse autoencoder framework that unsupervisedly discovers latent reasoning behaviors in large language models and enables targeted intervention.
The methodology segments chain-of-thought activations and employs sparsity-constrained reconstruction to map distinct reasoning strategies like reflection and backtracking.
Intervention experiments demonstrate that manipulating decoder vectors can steer behavior, improving accuracy by up to 4.66 percentage points and reducing token cost by 13.69%.

Unsupervised Discovery of Reasoning Behaviors in LLMs

Introduction

This paper introduces RISE (Reasoning behavior Interpretability via Sparse auto-Encoder), an unsupervised mechanistic interpretability framework which identifies and steers distinct reasoning behaviors within LLMs by applying Sparse Autoencoders (SAEs) to intermediate activations in chain-of-thought (CoT) traces. Moving beyond human-labeled, token-level concepts, RISE uncovers latent behavioral vectors that represent complex reasoning strategies such as reflection and backtracking, as well as more abstract or difficult-to-label cognitive properties including confidence and inference length. These vectors are shown to be interpretable, highly disentangled, and causally manipulable at inference—enabling precise control over the LLM’s reasoning trajectory without retraining.

Figure 1: The RISE pipeline: training an SAE on step-level activations and intervening in the original model using decoder vectors to amplify or suppress specific reasoning behaviors.

Methodology: Sparse Autoencoder for Unsupervised Reasoning Mining

The methodology builds on the Linear Representation Hypothesis, positing that cognitive behaviors are encoded as directions in the residual activation space. RISE partitions model outputs into step-level representations—obtained at predefined delimiters in CoT reasoning traces—and trains an SAE to reconstruct these while enforcing sparsity in latent activations. Each decoder column in the SAE thus aligns with an atomic, interpretable behavioral vector.

Formally, for model activations $h \in \mathbb{R}^d$ at each step, the SAE reconstructs them via

$\hat{h} = W_\mathrm{decoder}^\top \, \sigma(z) + b_\mathrm{decoder}$

with latent codes $z = \sigma(W_\mathrm{encoder}^\top h + b_\mathrm{encoder})$ and an objective combining reconstruction loss and $\ell_0$ sparsity on $z$ . Theoretical analysis (Theorem 1 in the paper) establishes that, under standard incoherence and sparsity assumptions, the SAE recovers a dictionary of reasoning directions (up to scaling and permutation) in the decoder space.

The empirical pipeline encompasses: (1) extraction and segmentation of reasoning step activations per sample, (2) SAE training at a fixed layer with relatively low dimensionality to target behavior-level (rather than surface linguistic) structure, and (3) post hoc analysis and intervention using decoder vectors.

Emergence and Visualization of Disentangled Reasoning Behaviors

A core result is the empirical validation that distinct reasoning behaviors are mapped onto geometrically distinct clusters in the SAE decoder column space. Human-labeled step categories—reflection, backtracking, and other—were mapped back to their latent activations, with clear spatial separation observed under UMAP visualization.

Figure 2: Decoder columns projected to two dimensions; columns associated with human-defined behaviors form concentrated clusters, evidencing alignment between SAE structure and reasoning categories.

Quantitative analysis using normalized Silhouette scores demonstrates increasing separation of behavioral subspaces toward model mid-to-late layers, peaking before the last few layers where oversmoothing is more pronounced. Notably, reflection and backtracking occupy overlapping but non-identical subspaces, sharply distinct from the residual “other” behaviors.

Figure 3: Normalized Silhouette scores across model layers reveal the layerwise evolution of behavioral subspace separation.

Causal Intervention and Steering of Reasoning Trajectories

The paper demonstrates strong evidence that interventions along SAE-derived directions have predictable, robust causal effects on the LLM’s reasoning process. By injecting the average decoder vector of reflection-labeled columns at each step (normalized), the authors are able to amplify or suppress reflection in the model's reasoning trace while maintaining answer correctness. This process is sample- and test-time agnostic: the steering operation is applied with fixed vectors and requires no further fine-tuning.

Figure 4: Shifts in reflection/backtracking behavior frequency across tasks and models due to SAE interventions.

Figure 5: Model outputs with positive and negative reflection intervention; steps associated with reflection are highlighted.

Memory activation manipulation along the identified axes generalizes to different tasks (e.g., from mathematics to commonsense and logic reasoning) and across models, such as R1-1.5B and Qwen3-8B. Steered behaviors follows predictable monotonic trends with intervention magnitude: for reflection on the AIME25 benchmark, for instance, the number of reflection steps shifted from 58.6 to 166.9 as the intervention scalar varied from $-1.5$ (suppression) to $+1.5$ (amplification).

Discovery of Emergent Cognitive Behaviors: Confidence and Length

RISE further reveals that abstract reasoning properties, notably confidence and response length, are also encoded as distinct, interpretable directions.

To identify “confidence vectors,” the authors optimize for decoder columns associated with reduced output entropy across training samples. These vectors localize in the SAE space and, when injected, produce responses with (i) fewer metacognitive tokens and phrases (e.g., fewer “wait”, less backtracking), and (ii) a notable reduction in reflective/backtracking reasoning steps (e.g., on AIME25, reflection steps drop from 90.53 to 33.77 after intervention, and backtracking from 35.50 to 5.93).

Figure 6: Confidence-related decoder columns cluster distinctly; interventions alter frequent step-initial tokens, replacing metacognitive cues with confident calculation sequences.

Length-related behaviors are similarly captured: columns corresponding to long- and short-response clusters emerge, especially in the later layers.

Figure 7: SAE decoder columns (highlighted) associated with short (green) and long (yellow) responses.

RISE for Steering Reasoning Behavior and Efficiency

The practical utility of RISE is highlighted via experiments on accuracy and token efficiency by combining confidence vectors at inference in a sample-specific manner. Compared to previous steering techniques (e.g., SEAL, TIP), RISE-based steering achieves up to 4.66 percentage points improvement in accuracy and 13.69% reduction in token cost on the evaluation set.

Figure 8: Accuracy and token cost improvements under various steering methods.

Theoretical and Practical Implications

This work establishes that LLM cognitive behaviors—previously accessible only via coarse human annotation or surface artifact analysis—can be unsupervisedly discovered and precisely manipulated via highly structured, sparse subspaces in the activation geometry. It provides both rigorous theoretical underpinning and extensive empirical validation that high-level reasoning styles are encoded linearly, extend to novel domains, and can enable dynamic test-time adaptation without model retraining.

Practically, this enables LLMs to be fine-tuned at inference to match desired human behavior or application requirements (conciseness, deliberation, confidence), advancing transparency, safety, and resource efficiency. Conceptually, it suggests a unifying mechanistic basis for cognitive phenomena in large-scale sequence models, with clear directions for improved explainability and control.

Future work could expand on automating latent space exploration for domain-specific behaviors, formalizing connections to neuroscience models of cognition, or developing user-facing tools for interactive CoT manipulation.

Conclusion

The RISE framework represents a robust and scalable paradigm for unsupervised discovery and control of model-internal reasoning behaviors. By leveraging sparse autoencoders to uncover and steer atomic reasoning directions, RISE advances both the mechanistic interpretability and operational controllability of LLMs. These findings open new avenues for aligning reasoning style, enhancing efficiency, and developing explainable AI systems with fine-grained and dynamic behavior adaptation.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What this paper is about (the big idea)

This paper is about peeking inside LLMs to understand how they “think” when solving problems step by step. Instead of guessing what’s happening from the words they write, the authors look directly at the model’s internal signals to find hidden “dials” that control reasoning behaviors like checking work (reflection) or changing plans (backtracking). Even better, they show we can turn those dials up or down to steer how the model reasons—without retraining it.

What questions did the researchers ask?

Can we discover different reasoning behaviors inside an LLM automatically, without hand-labeling examples like “this is reflection” or “this is backtracking”?
Do these behaviors live in neat, separate places inside the model’s internal space?
If we find the right “directions” (reasoning vectors) inside the model, can we push the model to reflect more, reflect less, backtrack more, or backtrack less?
Are there other useful behaviors we can uncover—like confidence—that aren’t easy to define by looking at words alone?
Do these discoveries work across different kinds of tasks, not just math?

How did they study it? (simple explanation of the method)

Think of an LLM as having a control room full of sliders and switches (its internal activations). When the model solves a problem step by step, those sliders move in certain patterns.

Here’s what the authors did:

Break reasoning into steps
- They asked a reasoning model to solve problems and wrote down each step of its chain-of-thought (like sentences separated by a special marker \n\n).
- For each step, they captured the model’s internal state right at that separator—like taking a snapshot of what the model is “thinking” between steps.
Learn the “parts” of reasoning with a Sparse Autoencoder (SAE)
- An SAE is a special tool that compresses and reconstructs data while using only a few “parts” at a time.
- Imagine you want to build many different LEGO models using the smallest possible set of reusable pieces. The SAE learns those pieces.
- In the model’s control room, each learned “piece” is a direction you can push the sliders—a reasoning vector. Each vector tends to represent a distinct behavior.
See if the pieces match real behaviors
- They looked at which vectors activate when the model is reflecting or backtracking (using a separate judging model to label steps).
- They visualized the vectors on a 2D map to see if similar behaviors cluster together (they did!).
Try steering the model
- They edited the model’s internal state by adding or removing specific vectors—like turning a dial up or down.
- For example, subtracting the “reflection” vector reduced reflection; adding it increased reflection. This changed how the model reasoned, but often kept the final answer the same.
Discover new behaviors (like confidence)
- Instead of labeling “confidence” by words, they searched for vectors that make the model’s predictions more certain (lower entropy).
- They found concentrated “confidence” directions that, when applied, made the model write fewer “Wait” or “Alternatively” phrases and more direct calculations.

What did they find, and why is it important?

Here are the main findings:

Reasoning lives in directions
- The SAE discovers clean, separate directions in the model’s internal space that line up with understandable behaviors like reflection and backtracking.
You can steer reasoning without retraining
- By adding or removing a behavior’s vector during inference, the model shows more or less of that behavior. For instance:
- Turning down reflection led to fewer self-checking steps and shorter solutions.
- Turning it up led to more self-checking and longer solutions.
- The final answers often stayed correct, showing we’re changing style, not just accuracy.
Behaviors cluster and show structure
- On a 2D map, vectors for similar behaviors cluster together.
- Mid-to-late layers of the model show the clearest separation between behaviors.
- The model’s internal space also organizes by response length—some vectors align with longer or shorter reasoning.
New behavior: confidence
- They found “confidence vectors” by optimizing for lower uncertainty.
- Applying these vectors reduced reflection/backtracking words and increased precise math steps.
- This worked across different tasks (math, science, logic), not just the training set.
Practical gains
- By combining the top confidence vectors and tuning them per question, they improved accuracy and used fewer tokens compared to some existing steering methods.

Why it matters:

It gives a clearer, testable way to understand how LLMs reason.
It allows fine-grained control at test time (no extra training), useful for making models faster (less overthinking) or more careful (more checking) depending on the situation.
It helps us move beyond hand-picked labels to discover new, useful behaviors automatically.

What could this change in the future?

Smarter, adjustable reasoning: Apps could dial up “careful checking” for safety-critical tasks, or dial it down for speed when the stakes are lower.
Better transparency: Knowing which internal directions map to which behaviors builds trust and helps diagnose mistakes.
Efficiency and cost savings: Steering away from unnecessary steps can cut token usage without hurting accuracy.
New discoveries: The same unsupervised approach could reveal other helpful behaviors we haven’t named yet (like planning quality, doubt handling, or consistency), and let us steer them too.

Key terms in everyday language

Activation/hidden state: The model’s “brain activity” while thinking.
Vector/direction: A particular way to nudge that activity—like moving a slider in a specific direction.
Sparse Autoencoder (SAE): A tool that learns a small set of reusable “thinking pieces,” using only a few at a time so each piece is easier to understand.
Reflection: The model checks and revisits earlier steps.
Backtracking: The model abandons a path and tries a new approach.
Entropy (confidence): Low entropy = high confidence; high entropy = uncertainty.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains missing, uncertain, or unexplored in the paper, framed to be actionable for future research.

Step segmentation robustness: The method assumes sentence-level steps detected via the delimiter token <\n\n>. It is unclear how sensitive discovery and steering are to segmentation choices, tokenization, formatting styles, multilingual text, or models that do not emit such delimiters. Evaluate alternative, model-agnostic step boundary detectors and their impact on SAE features.
Online applicability: The approach extracts step representations by re-running inference with the full question+response to read delimiter activations. Specify and validate an online/streaming procedure for detecting steps and performing interventions without re-feeding completed outputs, including latency and engineering overhead.
Layer selection and multi-layer modeling: SAEs are trained on a single chosen layer (often late). Systematically assess which layers best capture behavior, whether combining multi-layer SAEs yields better disentanglement/controllability, and how layer choice affects causal interventions.
Intervention placement and mechanics: Interventions are injected at the last token of each step into the residual stream. Test different injection sites (pre/post layer norm, attention vs MLP blocks, start vs end of steps, all tokens vs last), quantify stability and off-target effects, and characterize impacts on layer norms and activation scaling.
SAE hyperparameter sensitivity: The decoder size (D=2048), sparsity strength, optimizer, and ReLU gating are chosen heuristically. Perform sensitivity analyses to quantify effects on reconstruction, sparsity, disentanglement, and controllability; establish principled criteria for hyperparameter selection.
Theoretical assumptions vs practice: The identifiability theorem assumes incoherence, strict sparsity, ReLU, and separated coefficients, and uses an $\|z\|_0$ penalty. Verify how closely real hidden states meet these assumptions, clarify the practical sparsity surrogate (e.g., top-k, $\ell_1$ ), and empirically test identifiability/stability across seeds and datasets.
Comparative baselines: The paper does not compare SAEs against other unsupervised methods (e.g., ICA, NMF, PCA, subspace identification). Benchmark these alternatives for behavior disentanglement, causal steerability, and interpretability to confirm SAE necessity.
Behavior labeling validity: Reflection/backtracking labels rely on an LLM-as-a-judge (GPT-5). Quantify label reliability (e.g., with human annotation, inter-annotator agreement, multi-judge consensus), measure labeling bias, and assess how label noise affects conclusions about decoder column semantics.
Causal validation beyond token-level cues: Current causal evidence uses step counts and token-frequency shifts (“Wait”, “Alternatively”). Develop stronger causal tests (e.g., randomized intervention schedules, placebo controls, causal mediation analyses) and outcome metrics beyond style (accuracy, calibration, self-consistency, error correction).
Generalization breadth and depth: SAEs trained on MATH500 were applied to GPQA-Diamond/KnowLogic. Expand evaluation to diverse domains (coding, planning, dialogue, multilingual), more task families, and larger model sizes; report accuracy, efficiency, and calibration under domain shift with statistical significance.
Confidence vector risks: Minimizing entropy may induce overconfidence or suppress useful metacognition. Measure calibration (ECE, NLL), error rates on hard cases, hallucination risk, and the trade-off between confidence and correction, including tasks where reflection/backtracking is beneficial.
Distinguishing epistemic confidence from style: Token-level shifts (numbers vs “Wait/Alternatively”) may reflect stylistic changes rather than true confidence. Design probes or behavioral tests that separate epistemic certainty from linguistic style and quantify the relationship to correctness.
Length geometry and causal control: Length emerges as a structural axis, but causal control of length was not demonstrated. Identify length vectors and test whether steering can compress/extend reasoning without harming accuracy; characterize the trade-offs between length, correctness, and efficiency.
Column purity and disentanglement: The method filters columns active across multiple behaviors and averages behavior-specific columns. Define and measure column “purity” (e.g., orthogonality, selectivity, mutual information with labels), and develop algorithms to deconfound overlapping features.
Multi-objective steering: Steering single behaviors can affect others (e.g., confidence overlaps with reflection/backtracking). Study multi-objective control (e.g., Pareto trade-offs), interference between vectors, and methods for constrained or compositional steering.
Stability of discovered dictionaries: Assess how SAE columns vary across random seeds, training subsets, and annotation protocols; quantify stability and reproducibility by aligning columns across runs and measuring consistency of behavior clusters.
Scaling and compute: Training SAEs at scale (larger models, more layers, bigger datasets) may be memory-intensive. Characterize computational costs, propose efficient/streaming training, and evaluate performance-compute trade-offs.
Safety and ethics of steering metacognition: Suppressing reflection/backtracking may reduce self-correction and increase harm. Evaluate safety impacts on adversarial, sensitive, or high-stakes tasks; assess fairness and privacy implications of activation edits.
Integration with training: Explore whether discovered vectors can guide fine-tuning/RL (e.g., regularizing toward or away from specific behaviors), improve reward modeling, or enable curriculum strategies that align structural and semantic features.
Multimodal extension: The approach is text-only. Investigate whether SAE-based behavior discovery and steering generalize to multimodal reasoning (vision-language, speech), and how representations and step segmentation should be adapted.
Metrics beyond UMAP/Silhouette: UMAP visualizations can be misleading. Introduce intrinsic geometry metrics (e.g., representational similarity analysis, linear probes, information-theoretic measures) to validate cluster separability and semantic alignment.
Production constraints: Document how to deploy RISE in real systems (APIs, streaming inference, latency budgets), including failure modes, monitoring for off-target effects, and fallback strategies when steering degrades performance.
Code/data release and reproducibility: The paper lacks public artifacts. Release code, SAE checkpoints, labeled steps, and evaluation scripts to enable replication; specify dataset splits, seeds, and exact prompts used for annotation and intervention.

View Paper Prompt View All Prompts

Practical Applications

Below are practical applications that can be derived from the paper’s findings on RISE (Reasoning behavior Interpretability via Sparse auto-Encoder). Each item notes sector relevance, potential tools/workflows, and key assumptions or dependencies that affect feasibility.

Immediate Applications

These can be deployed now with access to model activations and modest engineering.

Cross-sector: Serving-time reasoning control

Action: Wrap LLM inference with a RISE steering middleware to suppress or amplify specific reasoning behaviors (e.g., reflection, backtracking) and adjust response length and confidence, by injecting SAE-derived vectors at sentence-step delimiters.
Sectors: Software, education, finance, customer support, research platforms.
Tools/Workflows: “Meta-cognition knobs” (UI sliders for reflection/backtracking/length/confidence); per-prompt policies (e.g., concise mode, thorough mode); A/B testing harness for accuracy–token cost trade-offs.
Assumptions/Dependencies: Access to hidden states; ability to hook into residual streams and sentence-step delimiters; per-model/layer SAE training; intervention strength tuning to avoid accuracy loss.

Efficiency and token-budget governance

Action: Reduce overthinking and verbose CoT via negative interventions on reflection/backtracking or by steering toward confidence vectors; enforce token budgets while maintaining accuracy.
Sectors: Enterprise LLM deployment, cloud inference, mobile/on-device assistants.
Tools/Workflows: Token Budget Governor; adaptive verbosity controller; automatic early-exit triggers guided by confidence-related vectors.
Assumptions/Dependencies: Behavior vectors generalize across domains (supported by GPQA/KnowLogic results, but monitor accuracy); telemetry on accuracy/tokens; policy for graceful fallback when steering is counterproductive.

Reasoning analytics and interpretability dashboards

Action: Provide UMAP-based visualizations of decoder columns, cluster labels (reflection/backtracking/other), length-aware axes, and per-prompt behavior activations for audit and debugging.
Sectors: Model interpretability teams, academic labs, AI platform providers.
Tools/Workflows: Reasoning Analytics Dashboard; Silhouette-based separability reports across layers; behavior frequency counters per task.
Assumptions/Dependencies: Stable SAE geometry per layer; LLM-as-a-judge labeling for validation; access to step-level activations and logs.

Education: Adaptive tutoring styles

Action: Adjust tutors to be more concise or more self-checking; enable “Try alternatives” mode (backtracking) for exploration vs. “Be decisive” mode (confidence) for mastery checks.
Sectors: EdTech, institutional learning platforms.
Tools/Workflows: Student-facing sliders (“Explore vs. Confirm,” “Verbose vs. Concise”); teacher dashboards showing model reflection/backtracking occurrences; metacognitive training modules that expose when and why the model reflects.
Assumptions/Dependencies: Safety guardrails to prevent overconfident errors; domain-specific tuning of vectors for math vs. writing vs. logic.

Software engineering assistants

Action: Modulate code assistant behaviors: suppress unnecessary rambling; amplify backtracking when searching alternatives; increase confidence for routine tasks.
Sectors: Developer tools, CI/CD systems.
Tools/Workflows: IDE plugins with steering controls; CI “decisiveness profiles” for routine refactors; bug triage modes that increase backtracking to explore more fixes.
Assumptions/Dependencies: Reliable sentence-step delimitation in code reasoning; continuous monitoring for correctness regressions.

Finance and operations decision support

Action: Calibrate confidence and reflection based on risk appetite: increase reflection for high-stakes analyses; reduce verbosity to meet SLAs for routine tasks.
Sectors: Finance, operations, procurement.
Tools/Workflows: “Risk-aware reasoning profiles” tied to task types; audit logs of reasoning behaviors; compliance-ready reports showing reduced overthinking or increased checks on critical items.
Assumptions/Dependencies: Policy alignment; human-in-the-loop validation; clear thresholds for steering strength.

Healthcare and clinical documentation (non-diagnostic support)

Action: Control reasoning verbosity in clinical documentation assistants; increase reflection when summarizing complex charts; reduce meta-cognition in routine notes to save time.
Sectors: Healthcare IT.
Tools/Workflows: EHR-integrated reasoning control; behavior logs for compliance audits; profile presets (routine vs. complex cases).
Assumptions/Dependencies: Strict validation; medical safety constraints; avoid using steering alone for diagnostic decisions.

Research workflows

Action: Use RISE to mine new behaviors and structure (e.g., length axis) for benchmarking; test causal effects on reasoning styles across tasks and models; curate datasets based on behavior profiles.
Sectors: Academia, industrial research.
Tools/Workflows: Behavior mining pipelines; cross-model latent catalogs; per-layer separability analyses.
Assumptions/Dependencies: Per-model SAE training; robust cross-domain checks; reproducible labeling protocols.

Long-Term Applications

These require further research, scaling, or standardization before broad deployment.

Standardized “Reasoning Control API”

Action: Define an industry-standard interface for behavior vectors (reflection/backtracking/length/confidence) and steering primitives (projection, addition, strength).
Sectors: AI platform vendors, open-source ecosystems.
Tools/Workflows: SDKs for model-agnostic steering; registry of behavior vectors per model/layer; configuration-as-code for reasoning profiles.
Assumptions/Dependencies: Broad access to activations across providers; consistent step delimiters; governance for safe defaults.

Training-time integration and alignment

Action: Use SAE-derived behavior vectors as auxiliary objectives or regularizers in fine-tuning/RL to shape models toward concise, confident, or reflective styles as desired.
Sectors: Model training/finetuning providers.
Tools/Workflows: Behavior-aware RLHF/RLAIF; curriculum design using discovered latent axes; distillation that transfers desirable vectors from reasoning models to base models.
Assumptions/Dependencies: Stable identifiability of vectors during training; avoidance of harmful bias (e.g., excessive confidence); evaluation for robustness.

Adaptive compute orchestration and early exit

Action: Couple confidence vectors with dynamic early-exit policies and per-step compute allocation (think-when-needed); allocate test-time compute based on entropy/behavior signals.
Sectors: Cloud inference, embedded systems, cost-sensitive deployments.
Tools/Workflows: Policy engines that gate further steps if confidence is sufficient; elastic CoT controllers; multi-pass strategies that expand steps only when signals indicate need.
Assumptions/Dependencies: Reliable confidence–accuracy correlation per domain; safeguards against premature exits; latency-accuracy trade-off management.

Safety and governance monitoring

Action: Continuous auditing of reasoning styles for signs of deceptive or brittle patterns; guardrails that reduce entropy when the model exhibits risky behavior (e.g., backtracking loops).
Sectors: Policy, compliance, safety engineering.
Tools/Workflows: Behavior anomaly detectors; compliance dashboards; intervention templates for high-risk prompts.
Assumptions/Dependencies: Clear operational definitions of “risky” behaviors; approval workflows; evidence that steering mitigates risk without masking errors.

Multimodal and robotics extensions

Action: Generalize behavior vectors to vision-LLMs and planners to control decisiveness, exploration (backtracking), and plan verbosity; reduce compute in embedded robotics.
Sectors: Robotics, autonomous systems, industrial automation.
Tools/Workflows: Planning style regulators; exploration–exploitation knobs; mission-time controllers that enforce concise reasoning under constraints.
Assumptions/Dependencies: Multimodal adaptation of step delimitation; safe-task validation; hardware-aware latency controls.

Productization in enterprise platforms

Action: Ship “Reasoning Editor” features across LLM platforms: sliders for reflection/backtracking/length/confidence; per-team profiles; analytics on behavior–outcome relationships.
Sectors: SaaS AI suites, productivity platforms.
Tools/Workflows: Organization-level presets; role-specific reasoning styles (analyst vs. support agent); outcome-linked dashboards.
Assumptions/Dependencies: UX that conveys trade-offs; privacy-compliant logging; cross-task generalization of behavior vectors.

Cross-model latent behavior catalogs

Action: Build shared catalogs of behavior vectors that transfer across models; map correspondences between layers and latent spaces to enable portability.
Sectors: Open-source communities, benchmarking consortia.
Tools/Workflows: Alignment pipelines; vector matching across models; public repositories with metadata (layer, domain, effect sizes).
Assumptions/Dependencies: Sufficient structural similarity between models; robust mapping methods; licensing and IP considerations.

Sector-specific validated deployments (high stakes)

Action: Deploy rigorously validated steering in healthcare diagnostics, legal analysis, and regulated finance to balance reflection, decisiveness, and verbosity.
Sectors: Healthcare, legal, finance.
Tools/Workflows: Clinical/forensic validation studies; fail-safe thresholds; human-in-the-loop workflows.
Assumptions/Dependencies: Strong evidence for calibration and safety; regulatory approval; continuous monitoring and rollback mechanisms.

Overall assumptions and dependencies shared across applications:

Access to internal activations and the ability to inject steering vectors during inference.
Reliable sentence-step delimitation (the paper uses a delimiter token such as “<\n\n>”).
Per-model and per-layer SAE training (mid-to-late layers often yield better separability; oversmoothing near final layers may reduce distinctness).
Steering strength must be tuned to prevent accuracy degradation; behavior effects can be task- and domain-dependent.
Ethical constraints: avoid amplifying unwarranted confidence; ensure human oversight in high-stakes contexts.

View Paper Prompt View All Prompts

Glossary

Activation patching: An activation-level editing technique that replaces or modifies internal activations to study or control model behavior. "Activation steering modifies model outputs by editing internal activations, with notable approaches including representation engineering~\citep{zou2310representation}, activation patching~\citep{meng2022locating}, and DiffMean~\citep{marks2023geometry}."
Activation space: The vector space of internal model activations where directions can encode semantic or behavioral features. "discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors."
Activation steering: Modifying a model’s output by directly editing its internal activations along targeted directions. "Activation Steering. Activation steering modifies model outputs by editing internal activations, with notable approaches including representation engineering~\citep{zou2310representation}, activation patching~\citep{meng2022locating}, and DiffMean~\citep{marks2023geometry}."
Adam optimizer: A stochastic optimization method that uses adaptive moment estimates for efficient training. "We use the Adam optimizer~\citep{kingma2014adam} with cosine annealing learning rate decay."
argmin: The argument (input) that minimizes a given function; commonly used in optimization objectives. "\argmin_{S} \; \mathbb{E}!\left[- \sum_{k=1}^{|V|} p_k \log p_k \right]"
Backtracking: A reasoning behavior where the model abandons a current path and attempts an alternative approach. "backtracking (i.e., the model abandons the current reasoning path and pursues an alternative solution)"
Chain-of-Thought (CoT) prompting: A prompting strategy that elicits multi-step reasoning traces from LLMs. "Chain-of-Thought (CoT) prompting~\citep{wei2022chain}"
Confidence vectors: Directions in activation space that reduce output entropy and steer the model toward more confident reasoning. "Generalization of confidence vectors across data domains under different steering directions."
Cosine annealing learning rate decay: A schedule that reduces the learning rate following a cosine curve over training. "We use the Adam optimizer~\citep{kingma2014adam} with cosine annealing learning rate decay."
Cosine similarity: A metric that measures directional similarity between vectors, often used for embedding comparisons. "We adopt UMAP for visualization because it leverages cosine similarity as the internal metric, emphasizing directional rather than Euclidean differences."
Decoder column space: The subspace spanned by the columns of an autoencoder’s decoder matrix, interpreted as feature directions. "Visualization and clustering analyses show that these behaviors occupy separable regions in the decoder column space."
Decoder columns: Individual columns of the decoder matrix that act as atomic feature directions for reconstruction and steering. "We show that individual decoder columns correspond to interpretable reasoning behaviors (Figure~\ref{fig:method})."
Decoder matrix: The linear transformation in an autoencoder that reconstructs inputs from latent codes; its columns represent learned features. "recovers a decoder matrix whose columns align with the true dictionary up to permutation matrix $\Pi$ and scaling diagonal matrix $D$ :"
Difference-of-Means (DiffMean): A supervised method that computes a steering direction by subtracting the mean activation of one class from another. "From the perspective of mechanistic interpretability, many studies rely on activation engineering methods, particularly the Difference-of-Means (DiffMean) approach~\citep{marks2023geometry}."
Dictionary (sparse coding): A set of basis vectors used to represent inputs sparsely via linear combinations. "is a dictionary of latent behavior directions"
Entropy: A measure of uncertainty in the model’s output distribution; lower entropy indicates higher confidence. "we adopt entropy as our proxy objective"
Incoherence: A dictionary property limiting pairwise correlations between atoms to enable sparse recovery and identifiability. "Assume: (i) Incoherence: $\max_{i\neq j} \frac{|\langle w_i,w_j\rangle|}{\|w_i\|\|w_j\|} \le \mu < 1$ ."
k-sparse code: A latent representation with at most k nonzero entries, promoting interpretability via sparsity. "a is a $k$ -sparse code"
Latent space: The space of learned codes or features where semantic behaviors can be disentangled and manipulated. "Figure~\ref{fig:sae_semantic} illustrates that the SAE decoder columns encode semantically meaningful behaviors in the latent space."
Linear Representation Hypothesis: The view that abstract features are linearly encoded in model activations and can be isolated as directions. "Building on the Linear Representation Hypothesis~\citep{olah2024linear,park2023linear}"
LLM-as-a-judge: Using an LLM to label or evaluate model outputs or steps according to specified criteria. "The classification is performed using an LLM-as-a-judge approach"
Mechanistic interpretability: Methods aiming to explain model behavior via internal mechanisms like circuits, features, and activation geometry. "From the perspective of mechanistic interpretability, many studies rely on activation engineering methods"
Oversmoothing phenomenon: The tendency for deep transformer layers to make token representations overly similar, reducing separability. "This observation aligns with the oversmoothing phenomenon in current LLMs~\citep{wang2023pangu}"
Permutation matrix: A matrix that reorders basis vectors; used to express equivalence up to reindexing of decoder columns. "up to permutation matrix $\Pi$ and scaling diagonal matrix $D$ :"
Prefilling stage: The initial phase of inference where context is processed before generating tokens; parameters or biases may be adapted here. "learns a bias term at test time during the prefilling stage"
ReLU: A non-linear activation function defined as max(0, x), used to induce sparsity in codes. "For a standard SAE~\citep{cunningham2023sparse}, we use $\mathrm{ReLU}$ as the non-linear activation function $\sigma$ ."
Residual stream: The main vector pathway in transformer layers where token representations accumulate via residual connections. "We specifically use the residual stream representations after each transformer layer"
Representation engineering: Editing or steering internal representations to control model behavior. "Activation steering modifies model outputs by editing internal activations, with notable approaches including representation engineering~\citep{zou2310representation}"
RISE: The paper’s framework: Reasoning behavior Interpretability via Sparse auto-Encoder, for unsupervised discovery and control of reasoning vectors. "we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors"
SAE (Sparse Auto-Encoder): An autoencoder trained with sparsity constraints to learn disentangled features aligned with semantic behaviors. "we employ sparse auto-encoders (SAEs), which learn a dictionary of sparse latent features that reconstruct hidden states while promoting disentanglement."
Silhouette score: A clustering metric measuring cohesion and separation to quantify cluster quality. "Then, we quantify the above analysis by Silhouette scores~\citep{Rousseeuw87jcam_Silhouettes,shahapure2020cluster}."
Softmax: A function converting logits into a probability distribution over tokens. "p_k = \mathrm{softmax}!\big(f_{l \to L}(h + S W_{\mathrm{decoder})\big)_k"
Sparsity: The property of having few active features or nonzero code entries; often enforced via L0 penalties. "which encourages the SAE to accurately reconstruct the original activations while enforcing sparsity in the latent codes."
Steering vector: A direction added or projected in activation space to shift model behavior along a desired axis. "The resulting steering vector can then be applied to shift the response style"
Test-time scaling: Increasing compute at inference (e.g., more samples or longer reasoning) to boost performance. "test-time scaling~\citep{snell2024scaling}"
UMAP: Uniform Manifold Approximation and Projection; a dimensionality reduction method emphasizing local/global structure via a graph-based approach. "we apply Uniform Manifold Approximation and Projection (UMAP)~\citep{mcinnes2018umap} to embed the decoder columns into a two-dimensional space."

Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

Summary

Unsupervised Discovery of Reasoning Behaviors in LLMs

Introduction

Methodology: Sparse Autoencoder for Unsupervised Reasoning Mining

Emergence and Visualization of Disentangled Reasoning Behaviors

Causal Intervention and Steering of Reasoning Trajectories

Discovery of Emergent Cognitive Behaviors: Confidence and Length

RISE for Steering Reasoning Behavior and Efficiency

Theoretical and Practical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about (the big idea)

What questions did the researchers ask?

How did they study it? (simple explanation of the method)

What did they find, and why is it important?

What could this change in the future?

Key terms in everyday language

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Cross-sector: Serving-time reasoning control

Efficiency and token-budget governance

Reasoning analytics and interpretability dashboards

Education: Adaptive tutoring styles

Software engineering assistants

Finance and operations decision support

Healthcare and clinical documentation (non-diagnostic support)

Research workflows

Long-Term Applications

Standardized “Reasoning Control API”

Training-time integration and alignment

Adaptive compute orchestration and early exit

Safety and governance monitoring

Multimodal and robotics extensions

Productization in enterprise platforms

Cross-model latent behavior catalogs

Sector-specific validated deployments (high stakes)

Glossary

Open Problems

Continue Learning

Related Papers

Authors (9)

Collections

Tweets

YouTube

HackerNews

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research