First Finish Search: Efficient Test-Time Scaling in Large Language Models (2505.18149v1)

Published 23 May 2025 in cs.CL

Abstract: Test-time scaling (TTS), which involves dynamic allocation of compute during inference, offers a promising way to improve reasoning in LLMs. While existing TTS methods work well, they often rely on long decoding paths or require a large number of samples to be generated, increasing the token usage and inference latency. We observe the surprising fact that for reasoning tasks, shorter traces are much more likely to be correct than longer ones. Motivated by this, we introduce First Finish Search (FFS), a training-free parallel decoding strategy that launches $n$ independent samples and returns as soon as any one completes. We evaluate FFS alongside simple decoding, beam search, majority voting, and budget forcing on four reasoning models (DeepSeek-R1, R1-Distill-Qwen-32B, QwQ-32B and Phi-4-Reasoning-Plus) and across four datasets (AIME24, AIME25-I, AIME25-II and GPQA Diamond). With DeepSeek-R1, FFS achieves $82.23\%$ accuracy on the AIME datasets, a $15\%$ improvement over DeepSeek-R1's standalone accuracy, nearly matching OpenAI's o4-mini performance. Our theoretical analysis explains why stopping at the shortest trace is likely to yield a correct answer and identifies the conditions under which early stopping may be suboptimal. The elegance and simplicity of FFS demonstrate that straightforward TTS strategies can perform remarkably well, revealing the untapped potential of simple approaches at inference time.

Summary

The paper presents First Finish Search (FFS), a method that selects the shortest reasoning trace to enhance LLM inference accuracy.
It employs parallel decoding trajectories with synchronous and asynchronous variants to significantly cut token usage and latency.
Empirical results show up to a 45% reduction in token consumption on benchmarks, demonstrating FFS’s practical efficiency gains.

The paper introduces First Finish Search (FFS), a straightforward yet effective test‐time scaling (TTS) strategy designed to boost the reasoning performance of LLMs without additional training. Unlike more complex methods that either generate long reasoning paths or rely on sophisticated output aggregation like majority voting or beam search, FFS leverages the empirical observation that shorter generated reasoning traces are much more likely to be correct.

Key Contributions and Ideas

Observation on Trace Length:

The authors notice that in many reasoning tasks, correct reasoning traces tend to be significantly shorter than incorrect ones. They analyze the underlying distributions (modeled as normal distributions) of trace lengths for correct versus incorrect outputs and derive a formula showing that the probability a trace is correct increases when its length is shorter. This insight motivates the idea of “first-to-finish” selection.

First Finish Search (FFS) Strategy:

FFS runs multiple (n) independent decoding trajectories in parallel using standard stochastic sampling (with beam size 1 to maximize diversity) and returns the first output that reaches the end-of-sequence (EOS) token. As the shortest trace wins (effectively maximizing a negative length reward), FFS tends to select the most concise—and therefore often the most accurate—reasoning path.

Algorithm Variants:
- Sync-FFS: A synchronous decoder launches n parallel samples in a single batched forward pass. All samples proceed in lockstep, and decoding stops immediately when any sample emits the EOS token, thus lowering sequential compute cost on centralized GPUs or servers.
- Async-FFS: An asynchronous variant runs each decoding job on separate processes or machines. It monitors the set of jobs and terminates all others as soon as one completes. This variant is well suited for distributed or multi-worker environments.
Theoretical Analysis:

The paper develops a formal expression for the probability that a trace is correct given its length and uses extreme value theory to demonstrate that the expected sequential cost (i.e., the number of tokens processed until a sample finishes) decreases roughly as O(√(log n)) with increasing number of parallel samples. This theoretical insight explains how FFS can achieve lower latency and token usage compared to conventional methods that require waiting for all samples.

Empirical Evaluation:

FFS is tested on several challenging reasoning benchmarks including AIME24, AIME25 (both I and II variants), and the GPQA Diamond dataset. The experiments span multiple LLMs such as DeepSeek-R1, QwQ-32B, R1-Distill-Qwen, and Phi‑4‑Reasoning‑Plus. Results show that FFS matches or exceeds the accuracy of strong baselines like majority voting, beam search, and budget forcing, while reducing total and sequential token usage by up to 45%. Furthermore, FFS exhibits pronounced benefits when applied to models that are specifically tuned for multi-step reasoning as opposed to non-reasoning counterparts.

Practical Advantages:

FFS is training-free and API friendly—it requires only standard sampling without special tokens or additional logit manipulation. Its simple “first-to-finish” rule allows it to leverage parallel computing efficiently (both total compute cost and latency are reduced) and it scales gracefully with increased model capacity and additional compute resources.

Implementation Considerations and Trade-offs

For centralized implementations, Sync-FFS minimizes the memory overhead by batching forward passes, while Async-FFS is ideal for scalable, distributed setups.
Since FFS selects the trace that terminates earliest, it substantially reduces inference time in throughput-bound or API-metered environments.
The method works best on models that already incorporate robust chain-of-thought reasoning; its benefits diminish with models that generate overly degenerate or less structured reasoning traces.
FFS’s simplicity makes it easy to integrate into existing inference pipelines and to experiment with different numbers of samples (n) depending on the available parallel resources.

Conclusion

The paper demonstrates that simple test-time strategies can unlock notable efficiency gains for LLM reasoning. By prioritizing shorter reasoning traces—the ones most likely to be correct—FFS not only improves accuracy but does so with a lower sequential token budget. Its favorable scaling, theoretical grounding, and practical API-friendly design make FFS an attractive choice for enhancing LLM inference in real-world applications, particularly when compute budgets and latency are critical constraints.

PDF Markdown

Tweets

https://twitter.com/Tanmoy_Chak/status/1926893372502950044

https://twitter.com/AradhyeAgarwal/status/1926837275976925309

https://twitter.com/AradhyeAgarwal/status/1937173201785643207

https://twitter.com/AradhyeAgarwal/status/1938313967912567257

First Finish Search: Efficient Test-Time Scaling in Large Language Models (2505.18149v1)

Summary

Related Papers

Tweets