Search More, Think Less Paradigm

Updated 1 March 2026

SMTL is a reasoning paradigm that replaces deep sequential thought with parallel, early-terminated searches to achieve efficient computation.
It is implemented in LLMs, reinforcement learning, agentic workflows, and algorithmic proofs to reduce tokens, latency, and enhance predictability.
Empirical benchmarks demonstrate that SMTL improves cost-efficiency and accuracy, challenging traditional deep reasoning methods.

The Search More, Think Less (SMTL) Paradigm

The "Search More, Think Less" (SMTL) paradigm represents a shift in reasoning algorithm design and model deployment, emphasizing parallel evidence acquisition, concise or early-terminated reasoning, and efficiency over the traditional approach of deep, sequential, or verbose reasoning chains. SMTL is instantiated across LLM inference, RL-based policy learning, structured algorithm design for mathematical optimization, and agentic web-based research. This development has immediate consequence for cost-efficient reasoning, adaptive compute, and theoretical algorithm analysis.

1. Core Definition and Motivation

The SMTL paradigm systematically trades "depth" of sequential, internal reasoning for "breadth"—that is, for searching multiple candidate solutions in parallel but with minimal or early-terminated inner reasoning. Classic reasoning LLM workflows, such as chain-of-thought (CoT) and self-consistency, attempt to elicit correctness via extended autoregressive "thinking chains" per prompt, aggregating over several long outputs. In contrast, SMTL proposes that sampling more (short, diverse, or non-overlapping) reasoning attempts—each likely to halt early, or skip coarse-grained thinking steps—yields superior correctness per compute and per unit latency. Empirical evidence shows that, contrary to intuition, increased thinking depth can actually degrade accuracy and massively inflate cost. This directly challenges the "more-compute-equals-better-reasoning" paradigm that has dominated both LLM and agentic architectures (Hassid et al., 23 May 2025).

2. Paradigms and Algorithmic Instantiations

SMTL has several canonical realizations, notably in LLM inference, RL-based training, retrieval-augmented reasoning, agentic research, and approximation algorithms in theoretical computer science.

2.1 LLM Reasoning: short-m@k and NoThinking

short-m@k: For a prompt $q$ , launch $k$ parallel generations. As soon as the $m$ fastest chains complete, terminate all others. Aggregate the $m$ answers by majority (breaking ties toward the shortest chain). For $m=1$ , this selects the single shortest chain. For $m=3$ , aggregates the top-3. Empirically, the shortest chains are not only faster, but exhibit up to 34.5 pp higher accuracy than the longest chains, and outperform majority@k under limited compute by consuming up to 40% less tokens or wall-time. Detailed performance: with $k=9$ , short-1@9 is more than 55% faster than majority@9 for equal accuracy (Hassid et al., 23 May 2025).
NoThinking: Removes explicit CoT-style "thinking blocks" from LLM prompts, forcing immediate answer generation. Combining NoThinking with parallel sampling and best-of-N aggregation (using verifiers or confidence metrics) further increases efficiency, with 2×–5× fewer tokens for superior pass@k, and up to 9× latency reductions at fixed accuracy. In low-budget settings, NoThinking’s pass@1 is often nearly double that of Thinking (e.g., 51.3% vs 28.9% on AMC 2023 at 700 tokens) (Ma et al., 14 Apr 2025).

2.2 RL Training: Group Filtering and Token-Efficiency

Group Filtered Policy Optimization (GFPO): At train time, sample $G$ candidate chains per question, filter to the $k$ shortest or most token-efficient ( $R/\text{tokens}$ ) chains, compute policy gradients only for the selected subset. This technique reduces inference chain length by 46–85% (relative ELR), without sacrificing accuracy. Adaptive Difficulty GFPO further tunes $k$ per-question based on empirical difficulty, focusing compute on the most challenging examples (Shrivastava et al., 13 Aug 2025).

2.3 Agentic Search: Plan–Execute–Refine

SMTL restructures agentic workflows around parallelized evidence acquisition. Instead of single-threaded, tool-loop, or deep-coherence chains, SMTL agents decompose tasks into subtasks, fetch or search for evidence in parallel, iteratively refine context, and periodically reset context windows to maintain budget and efficiency. This design enables agents to operate within fixed context limits and universalize across both deterministic QA and deep research tasks (Chen et al., 26 Feb 2026).

2.4 Algorithmic Proof Synthesis: Search-and-Mix

In empirical algorithmics, e.g., approximate Nash equilibria, SMTL is realized as a two-phase "search-and-mix": first, search for a finite pool of candidate solutions; then, mix (by solving a compact LP or QCQP) to prove worst-case guarantees. All known 2-player Nash bounds can be immediately recovered by this template, compressing hand proofs into automated LPs, and generalizing to other LP-based algorithm analyses (Deng et al., 2023).

3. Mathematical Formulation and Cost Models

SMTL methods are generally characterized by the following cost and efficiency models:

Tokens and Latency: For $k$ parallel chains with random lengths $L_i$ , majority@k costs $k \cdot \mathbb{E}[L]$ tokens and latency $\mathbb{E}[\max_i L_i]$ . In contrast, short-m@k costs $\mathbb{E}\left[\sum_{i=1}^k \min(L_i, T_m)\right]$ tokens and latency $\mathbb{E}[T_m]$ , where $T_m$ is the $m$ -th order statistic of completion times. For moderate $k, m$ : $\mathbb{E}[\min_i L_i] \ll \mathbb{E}[L]$ , and $\mathbb{E}[T_m] \ll \mathbb{E}[\max_i L_i]$ (Hassid et al., 23 May 2025).
Group Filtering in RL: Only the advantages for filtered (short/token-efficient) chains are used in policy gradients, with unselected samples contributing to regularization but not learning signal. Filtering by $R(o)/|o|$ encourages brevity only so far as it supports reward maximization (Shrivastava et al., 13 Aug 2025).
Agentic MDP Setting: State space $\mathcal{S}$ is aggregated plan context; actions are parallel tool calls; reward is sparse and final-answer based; context resets enforce $\left|h_t\right| \leq K$ for fixed context window $K$ . Parallelism enables rapid evidence collection and lower step count (Chen et al., 26 Feb 2026).
Search-and-Mix Proofs: Worst-case regret for mixing over $(x_i, y_j)$ pairs is formulated as the solution to a small LP in variables $(\alpha, \beta, h)$ with constraints induced by search-phase outcomes; all mixing phases run in polynomial time for small $s, t$ (Deng et al., 2023).

4. Empirical Evidence and Comparative Results

SMTL approaches are validated by extensive empirical benchmarks:

Method	Cost Relative to Baseline	Accuracy Gain	Key Setting
short-1@k	-40% tokens, -55% time	=/+ accuracy vs majority@k	LN-Super-49B, k=9, math QA
short-3@k	up to -33% wall-time	+2 pp over majority@k	R1-32B, k=5, math QA
NoThinking+best-N	-4–9× latency	+1.7 pp vs Thinking@16	AMC 2023 math, pass@1
GFPO (RL)	-46–85% chain length	≈ accuracy vs GRPO (no dips)	Phi-4-reasoning, AIME 24/25
SMTL agents	-70.7% reasoning steps	↑ accuracy vs MiroThinker	BrowseComp, GAIA, DeepResearch
Search-and-Mix	zk polynomial time (mix)	Recovers known NE bounds	Nash eq. algorithms, LP proof

Across diverse metrics—including tokens, latency, wall-clock, excess chain length (ELR), and absolute accuracy—SMTL variants outperform depth-centric baselines, especially when compute budgets, latency, or context constraints are tight (Hassid et al., 23 May 2025, Ma et al., 14 Apr 2025, Shrivastava et al., 13 Aug 2025, Chen et al., 26 Feb 2026, Deng et al., 2023).

5. Training, Data, and Policy Optimization

Data pipelines and RL policy updates reflect the SMTL principle:

RL training under SMTL (GFPO) prioritizes sampling diversity and brevity within batches, with advantage computation and learning signal restricted to the most concise or token-efficient chains.
Agentic SMTL data pipelines blend synthetic hierarchical QA with report-style research queries, enabling agents to generalize cross-domain (Chen et al., 26 Feb 2026).
Curation strictly retains the shortest correct trajectories when duplicates exist, and supervised fine-tuning is always performed with minimal trajectory lengths.
Adaptive filtering (by observed problem difficulty) tunes the degree of selectivity, focusing training signal where longer reasoning is genuinely necessary (Shrivastava et al., 13 Aug 2025).

6. Broader Implications and Research Directions

The SMTL paradigm has catalyzed research in algorithmic efficiency, practical LLM deployment, and formal algorithm design:

Challenges the premise of fixed compute scaling: "More compute does not uniformly imply better reasoning." Counterintuitively, brevity and early stopping can yield gains in both correctness and economics (Hassid et al., 23 May 2025).
Enables adaptive, dynamic computation policies at both inference and training time: Early-exit mechanisms, difficulty-aware sampling, and per-sample context truncation are broadly generalizable.
Provides automated proof templates for approximation and online algorithms, transforming analytic methodology in theoretical CS (Deng et al., 2023).
SMTL agents facilitate scalable and generalizable cross-domain research, maintaining efficiency under large context constraints and task heterogeneity (Chen et al., 26 Feb 2026).
In retrieval and retrieval-augmented reasoning (e.g., AutoRefine), SMTL emphasizes interleaving search, context pruning, and evidence distillation rather than accumulating long, unstructured traces (Shi et al., 16 May 2025).
For LLM service providers, SMTL unlocks practical cost reduction, lower inference latency, and superior throughput in reasoning-centric applications.

7. Limitations and Open Challenges

SMTL does not guarantee that the shortest or most concise chains are always correct; certain hard tasks may require deeper reasoning, and selection metrics (e.g., confidence, verifier) may underperform in some domains (e.g., code synthesis). Dynamic budget and adaptive selection remain active areas for refinement. In algorithm design, the SMTL proof-template approach is limited to LP-relaxable (piecewise-linear-convex) settings but holds the promise of broad automation in proof synthesis (Deng et al., 2023). Agentic SMTL strategies place a premium on plan-centric context pruning but lack content-aware or learning-based compression, motivating further methodological research (Chen et al., 26 Feb 2026).

References: (Hassid et al., 23 May 2025, Ma et al., 14 Apr 2025, Shrivastava et al., 13 Aug 2025, Chen et al., 26 Feb 2026, Deng et al., 2023, Shi et al., 16 May 2025)