- The paper presents First Finish Search (FFS), a method that selects the shortest reasoning trace to enhance LLM inference accuracy.
- It employs parallel decoding trajectories with synchronous and asynchronous variants to significantly cut token usage and latency.
- Empirical results show up to a 45% reduction in token consumption on benchmarks, demonstrating FFS’s practical efficiency gains.
The paper introduces First Finish Search (FFS), a straightforward yet effective test‐time scaling (TTS) strategy designed to boost the reasoning performance of LLMs without additional training. Unlike more complex methods that either generate long reasoning paths or rely on sophisticated output aggregation like majority voting or beam search, FFS leverages the empirical observation that shorter generated reasoning traces are much more likely to be correct.
Key Contributions and Ideas
- Observation on Trace Length:
The authors notice that in many reasoning tasks, correct reasoning traces tend to be significantly shorter than incorrect ones. They analyze the underlying distributions (modeled as normal distributions) of trace lengths for correct versus incorrect outputs and derive a formula showing that the probability a trace is correct increases when its length is shorter. This insight motivates the idea of “first-to-finish” selection.
- First Finish Search (FFS) Strategy:
FFS runs multiple (n) independent decoding trajectories in parallel using standard stochastic sampling (with beam size 1 to maximize diversity) and returns the first output that reaches the end-of-sequence (EOS) token. As the shortest trace wins (effectively maximizing a negative length reward), FFS tends to select the most concise—and therefore often the most accurate—reasoning path.
- Algorithm Variants:
- Sync-FFS: A synchronous decoder launches n parallel samples in a single batched forward pass. All samples proceed in lockstep, and decoding stops immediately when any sample emits the EOS token, thus lowering sequential compute cost on centralized GPUs or servers.
- Async-FFS: An asynchronous variant runs each decoding job on separate processes or machines. It monitors the set of jobs and terminates all others as soon as one completes. This variant is well suited for distributed or multi-worker environments.
- Theoretical Analysis:
The paper develops a formal expression for the probability that a trace is correct given its length and uses extreme value theory to demonstrate that the expected sequential cost (i.e., the number of tokens processed until a sample finishes) decreases roughly as O(√(log n)) with increasing number of parallel samples. This theoretical insight explains how FFS can achieve lower latency and token usage compared to conventional methods that require waiting for all samples.
FFS is tested on several challenging reasoning benchmarks including AIME24, AIME25 (both I and II variants), and the GPQA Diamond dataset. The experiments span multiple LLMs such as DeepSeek-R1, QwQ-32B, R1-Distill-Qwen, and Phi‑4‑Reasoning‑Plus. Results show that FFS matches or exceeds the accuracy of strong baselines like majority voting, beam search, and budget forcing, while reducing total and sequential token usage by up to 45%. Furthermore, FFS exhibits pronounced benefits when applied to models that are specifically tuned for multi-step reasoning as opposed to non-reasoning counterparts.
FFS is training-free and API friendly—it requires only standard sampling without special tokens or additional logit manipulation. Its simple “first-to-finish” rule allows it to leverage parallel computing efficiently (both total compute cost and latency are reduced) and it scales gracefully with increased model capacity and additional compute resources.
Implementation Considerations and Trade-offs
- For centralized implementations, Sync-FFS minimizes the memory overhead by batching forward passes, while Async-FFS is ideal for scalable, distributed setups.
- Since FFS selects the trace that terminates earliest, it substantially reduces inference time in throughput-bound or API-metered environments.
- The method works best on models that already incorporate robust chain-of-thought reasoning; its benefits diminish with models that generate overly degenerate or less structured reasoning traces.
- FFS’s simplicity makes it easy to integrate into existing inference pipelines and to experiment with different numbers of samples (n) depending on the available parallel resources.
Conclusion
The paper demonstrates that simple test-time strategies can unlock notable efficiency gains for LLM reasoning. By prioritizing shorter reasoning traces—the ones most likely to be correct—FFS not only improves accuracy but does so with a lower sequential token budget. Its favorable scaling, theoretical grounding, and practical API-friendly design make FFS an attractive choice for enhancing LLM inference in real-world applications, particularly when compute budgets and latency are critical constraints.