Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
117 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute (2506.15882v1)

Published 18 Jun 2025 in cs.LG, cs.AI, cs.CL, and eess.SP

Abstract: Test-time compute has emerged as a powerful paradigm for improving the performance of LLMs, where generating multiple outputs or refining individual chains can significantly boost answer accuracy. However, existing methods like Best-of-N, majority voting, and self-reflection typically apply reasoning in a uniform way across inputs, overlooking the fact that different problems may require different levels of reasoning depth. In this work, we propose Fractional Reasoning, a training-free and model-agnostic framework that enables continuous control over reasoning intensity at inference time, going beyond the limitations of fixed instructional prompts. Our method operates by extracting the latent steering vector associated with deeper reasoning and reapplying it with a tunable scaling factor, allowing the model to tailor its reasoning process to the complexity of each input. This supports two key modes of test-time scaling: (1) improving output quality in breadth-based strategies (e.g., Best-of-N, majority voting), and (2) enhancing the correctness of individual reasoning chains in depth-based strategies (e.g., self-reflection). Experiments on GSM8K, MATH500, and GPQA demonstrate that Fractional Reasoning consistently improves performance across diverse reasoning tasks and models.

Summary

  • The paper introduces Fractional Reasoning, a training-free framework that uses latent steering vectors to dynamically adjust inference-time reasoning intensity, boosting overall performance.
  • It demonstrates how varying the scaling factor enables fine-grained control over reasoning, significantly enhancing results in majority voting and Best-of-N strategies.
  • Experimental results on benchmarks like GSM8K and MATH500 reveal notable accuracy improvements over standard inference methods with minimal extra computation.

This paper introduces "Fractional Reasoning (FR)," a training-free and model-agnostic framework designed to improve the test-time computational efficiency and performance of LLMs. The core problem FR addresses is that existing test-time compute strategies, like Best-of-N, majority voting, and self-reflection, typically apply a uniform level of reasoning intensity across all inputs, regardless of individual problem complexity. This can lead to under-thinking for complex problems or over-thinking and unnecessary computation for simpler ones.

Fractional Reasoning enables continuous control over reasoning intensity at inference time. It operates by first extracting a "latent steering vector" that represents the directional shift in the model's internal representations induced by reasoning-promoting prompts (e.g., "Think step by step" for Chain-of-Thought, or reflection instructions). This vector is derived by contrasting the latent states produced by positive (reasoning-promoting) and negative (direct-answering) prompts on a set of queries. Specifically, the steering vector hsteer\mathbf{h}_\text{steer} is the first principal direction of the differences between the latent representations of positive and negative examples:

hsteer:=argmaxh1mi=1m(h(h(Xipos)h(Xineg)))2s.t. hh=1\mathbf{h}_\text{steer} := \arg\max_{\mathbf{h}} \frac{1}{m} \sum_{i=1}^m \left( \mathbf{h}^\top \left( \mathbf{h}(\mathbf{X}_i^\text{pos}) - \mathbf{h}(\mathbf{X}_i^\text{neg}) \right) \right)^2 \quad \text{s.t.} \ \mathbf{h}^\top \mathbf{h} = 1

Once extracted, this steering vector is reapplied to the latent states ht\mathbf{h}_t of the query tokens (without the explicit instructional prompt) with a tunable scaling factor α\alpha:

h^t:=ht+αhsteer\hat{\mathbf{h}}_t := \mathbf{h}_t + \alpha \cdot \mathbf{h}_\text{steer}

The resulting steered latent states h^t\hat{\mathbf{h}}_t are then rescaled to maintain norm stability across layers using h~t=h^thth^t\tilde{\mathbf{h}}_t = \hat{\mathbf{h}}_t \cdot \frac{\| \mathbf{h}_t\|}{\| \hat{\mathbf{h}}_t \|}. This allows the model to modulate its reasoning depth or reflection strength dynamically without altering input text or requiring fine-tuning.

The framework supports two key modes of test-time scaling:

  1. Breadth-based scaling (e.g., Best-of-N, Majority vote): By varying α\alpha, FR generates a diverse set of outputs with different reasoning intensities. This increases the chance of producing a correct answer, improving success rates with fewer overall samples compared to standard methods.
  2. Depth-based scaling (e.g., self-reflection): FR allows fine-grained control over the strength of reflection, enabling the model to critique and revise its outputs more appropriately, avoiding under- or over-reflection. For reflection, a slightly modified steering vector extraction is used, directly taking the latent states of the input with the reflection prompt as hsteer\mathbf{h}_\text{steer} and using a different rescaling: h~t=11+α(ht+αhsteer)\tilde{\mathbf{h}}_t = \frac{1}{1+\alpha}(\mathbf{h}_t + \alpha \mathbf{h}_\text{steer}).

Experiments were conducted on mathematical reasoning benchmarks (GSM8K, MATH500) and general reasoning (GPQA) using open-source models like Qwen-2.5-7B-Instruct and LLaMA-3.1-8B-Instruct. FR was evaluated against standard test-time compute methods. For Chain-of-Thought prompting, positive prompts like "Solve the mathematics problem with step-by-step detailed reasoning" and negative prompts like "Solve the mathematics problem with direct answering" were used to derive the steering vector. For evaluation, multiple responses were generated using different α\alpha values, and the final answer was selected via majority vote or a Best-of-N approach using an external reward model (RLHFlow/Llama3.1-8B-PRM-Deepseek-Data).

Results consistently showed that FR enhances the performance of both majority voting and Best-of-N strategies across all benchmarks and models (Table 1). For example, with Llama-3.1-8B-Instruct on GSM8K, Majority vote + FR achieved 89.5% accuracy compared to 86.9% for standard Majority vote. Similarly, Best-of-N + FR achieved 90.3% compared to 79.1% for standard Best-of-N.

In reflection tasks (Table 2), FR again outperformed standard reflection prompting. For instance, Qwen-2.5-7B-Instruct with FR achieved 61.4% on MATH500, up from 59.2% with standard reflection. The paper also demonstrated that:

  • Increasing α\alpha leads to more verbose, detailed reasoning (Figure 3), confirming controllable behavior.
  • FR is effective even on models already specialized for reasoning, like DeepSeek-R1-Distill-Qwen-7B (Table 3).
  • FR scales robustly with an increased number of generations, often outperforming baselines more consistently (Figure 5).
  • The framework has potential for finer-grained, sentence-level control of reasoning strength, adapting α\alpha dynamically based on feedback signals like internal consistency (Figure 4).

The main contributions are:

  • A general, training-free framework for adaptive reasoning control (Fractional Reasoning).
  • Practical methods for extracting and applying latent steering vectors with tunable strength.
  • Demonstrated effectiveness across multiple models, benchmarks, and test-time scaling strategies.

A limitation noted is that the current approach relies on predefined reasoning directions and does not yet support automatic selection of the optimal scaling factor α\alpha per instance or step, highlighting an area for future work on adaptive policies.

Youtube Logo Streamline Icon: https://streamlinehq.com