Rethinking the Unsolvable: When In-Context Search Meets Test-Time Scaling (2505.22290v1)

Published 28 May 2025 in cs.AI, cs.CL, and cs.LG

Abstract: Recent research has highlighted that LLMs, even when trained to generate extended long reasoning steps, still face significant challenges on hard reasoning problems. However, much of the existing literature relies on direct prompting with simple in-context learning examples for evaluation, which largely overlooks advanced techniques to elicit LLMs' deliberate reasoning before drawing conclusions that LLMs hit a performance ceiling. In this paper, we systematically explore the combined potential of in-context search and test-time scaling on super hard reasoning tasks. We find that by employing advanced in-context search prompting to LLMs augmented with internal scaling, one can achieve transformative performance breakthroughs on tasks previously deemed "unsolvable" (e.g., reported success rates below 5%). We provide both empirical results and theoretical analysis of how this combination can unleash LLM reasoning capabilities: i) Empirically, on controlled NP-hard tasks and complex real-world planning benchmarks, our approach achieves up to a 30x improvement in success rates compared to previously reported results without any external mechanisms; ii) Theoretically, we show that in-context search prompting, when combined with internal scaling, significantly extends the complexity class of solvable reasoning problems. These findings challenge prevailing assumptions about the limitations of LLMs on complex tasks, indicating that current evaluation paradigms systematically underestimate their true potential. Our work calls for a critical reassessment of how LLM reasoning is benchmarked and a more robust evaluation strategy that fully captures the true capabilities of contemporary LLMs, which can lead to a better understanding of their operational reasoning boundaries in real-world deployments.

Summary

The paper introduces a novel method that combines in-context search with test-time internal scaling to significantly enhance LLM reasoning on difficult tasks.
It systematically evaluates advanced prompting techniques such as CoT and AoT alongside parallel, sequential, and internal scaling across NP-hard challenges.
Empirical and theoretical analyses reveal up to a 30-fold improvement in success rates, reshaping evaluation paradigms for large language models.

This paper, "Rethinking the Unsolvable: When In-Context Search Meets Test-Time Scaling" (2505.22290), investigates the true reasoning capabilities of LLMs on extremely difficult tasks, particularly those previously deemed "unsolvable" (e.g., reported success rates below 5%). The authors argue that existing evaluation paradigms, which often rely on direct prompting, systematically underestimate LLMs' potential. By systematically combining advanced in-context search prompting techniques with test-time scaling strategies, the paper demonstrates significant performance breakthroughs.

The core of the research lies in exploring how fine-grained combinations of two types of methods can enhance LLM reasoning:

In-Context Search Prompting: Strategies that enable LLMs to simulate search processes internally using learned in-context examples. The paper explores:
- Direct Prompting: Few-shot examples (problem-solution pairs) without explicit reasoning steps.
- Chain-of-Thought (CoT) Prompting: Examples demonstrating a step-by-step, greedy search-like solution path.
- Algorithm-of-Thought (AoT) Prompting: Examples demonstrating structured algorithmic operations (e.g., initialization, expansion, evaluation, backtracking) to guide the LLM through explicit search pathways like Depth-First Search.
Test-Time Scaling: Techniques that augment LLM reasoning at inference time. The paper experiments with:
- Parallel Scaling: Generating multiple outputs in parallel and aggregating them (specifically, Best-of-N with N=3).
- Sequential Scaling: Iteratively refining solutions (specifically, using Self-Refine).
- Internal Scaling: Allowing the model to autonomously determine computational effort, often through a learned policy (activated via a "thinking" mode in the chosen models).

Experimental Setup

The paper uses two challenging task categories, focusing on the most difficult instances (level 10):

Controlled NP-Hard Tasks: Vertex Cover and 3-Dimensional Matching (3DM), with 100 controlled problem instances generated according to methods in [yang2025nondeterministic].
Complex Real-World Planning: Trip Planning and Meeting Planning from the Natural Plan benchmark [zheng2024natural], which are natural language-based variations of NP-Hard problems like the Traveling Salesperson Problem (TSP), with 100 instances each.

Two LLMs were used, chosen for their ability to toggle an internal "thinking" mode:

Qwen3 235B (open-source)
Claude 3.7 Sonnet (closed-source API)

Performance is measured by Success Rate: the percentage of problem instances for which the LLM generates a complete and verifiably correct solution.

Empirical Findings (Ablation Studies)

The paper conducts a three-level ablation paper:

Level 1: Test-Time Scaling with Direct Prompting:
- Direct prompting without scaling, with parallel scaling, or with sequential scaling resulted in 0% success for both models on tasks like Trip Planning.
- Direct prompting with internal scaling (Direct-IS) showed a marginal improvement for Claude 3.7 (e.g., 4% on Trip Planning) but 0% for Qwen3.
- Takeaway 1: Direct prompting, even with test-time scaling, struggles significantly with hard tasks, yielding minimal gains, aligning with existing literature.
Level 2: In-Context Search without Scaling:
- CoT and AoT prompting without any test-time scaling (CoT-WS, AoT-WS) showed slight improvements for Claude 3.7 (e.g., 2-3% on Trip Planning) but 0% for Qwen3.
- Takeaway 2: Advanced in-context search prompting alone can improve performance but its standalone impact on very difficult tasks is marginal.
Level 3: Combining In-Context Search and Test-Time Scaling:
- Combining CoT/AoT with parallel or sequential scaling (e.g., CoT-PS, AoT-SS) showed further modest improvements, with Qwen3 showing a slight gain for the first time (e.g., 1% with AoT-SS on Trip Planning).
- The most significant breakthroughs occurred when CoT/AoT were combined with Internal Scaling (CoT-IS, AoT-IS).
- On Trip Planning:
- Qwen3: CoT-IS 24%, AoT-IS 30%
- Claude 3.7: CoT-IS 26%, AoT-IS 40%
- This represents up to a 30-fold improvement over configurations typically used in prior evaluations.
- Takeaway 3: Combining advanced in-context search (CoT/AoT) with internal scaling unlocks substantial LLM reasoning potential, significantly outperforming commonly used evaluation methods and challenging the notion of an "unsolvable" ceiling.

The paper notes that Qwen3 struggled more with tasks involving complex numerical abstract reasoning (Vertex Cover, 3DM) but improved on language-based planning tasks (Trip Planning, Meeting Planning) when the numerical input complexity was disentangled.

Theoretical Analysis

The paper provides a theoretical basis for these empirical improvements, focusing on how in-context search and internal scaling extend the complexity class of problems solvable by Transformers:

Key Definitions:
- In-Context Search Prompting: Formally defines Direct Prompting, CoT (greedy search trace), and AoT (algorithmic search trace with operations like backtracking).
- Internal Scaling: Mechanism allowing models to dynamically allocate computational effort (intermediate tokens), potentially scaling from polynomial $T=\mathtt{poly}(n)$ steps ("no-thinking" mode) to exponential $T=\mathtt{exp}(n)$ steps ("thinking" mode).
Theoretical Results (building on [merrill2023expressive] and [sel2023algorithm]):
- Theorem 3.1: $\mathsf{CoT(\mathtt{poly}(n)) = \mathsf{P}}$ . Transformers using a polynomial number of CoT steps can decide languages in $\mathsf{P}$ .
- Theorem 3.2: $\mathsf{AoT(\mathtt{poly}(n)) = \mathsf{NP}}$ . Transformers prompted with AoT using polynomial-length traces can decide languages in $\mathsf{NP}$ .
- Theorem 3.3: $\mathsf{CoT(\mathtt{exp}(n)) = \mathsf{EXP}}$ . With internal scaling allowing for an exponential number of CoT steps $t(n) = \mathtt{exp}(n)$ , Transformers can decide languages in $\mathsf{EXP}$ .
- Theorem 3.4: $\mathsf{AoT(\mathtt{exp}(n)) = \mathsf{NEXP}}$ . Similarly, with internal scaling allowing for exponential-length AoT traces $a(n) = \mathtt{exp}(n)$ , Transformers can decide languages in $\mathsf{NEXP}$ .
- Theorem 3.5 (Core Reasoning Tokens): The decidable complexity class depends on the length of the core computational trace ( $k_{\text{core}(n)}$ ), not just the total tokens ( $k_{\text{total}(n)}$ ). Generating redundant tokens doesn't increase computational power; a super-polynomial core trace is needed for problems beyond $\mathsf{P}$ / $\mathsf{NP}$ .

This theoretical framework suggests that internal scaling, by allowing the model to generate significantly longer (potentially exponential) chains of thought or algorithmic traces, fundamentally expands the computational power of LLMs, enabling them to tackle problems of higher complexity classes.

Implementation and Practical Applications

To implement these findings:

Prompt Engineering:

For CoT, provide few-shot examples that clearly break down the problem into a sequence of logical steps (greedy search).
For AoT, craft examples that explicitly demonstrate an algorithm (e.g., Depth-First Search). This involves showing:
- State definition (path, current cost, visited nodes).
- Initialization (e.g., picking a start node).
- Expansion (trying next valid moves).
- Evaluation/Pruning (checking constraints, identifying dead ends).
- Backtracking (reverting to previous states to try alternatives).
- Termination (solution found or search space exhausted).

The paper provides detailed prompt examples for Trip Planning in Appendix B.

# Example AoT-like thought step for Trip Planning
# ... previous steps ...
# Step: C4c
# Transition tried: ... -> Riga
# Calendar preview & test: Riga Day 8-10 -> UD = 10
# Outcome: keep
#
# Step: C4c1
# Transition tried: ... -> Brussels
# Calendar preview & test: Brussels Day 10-12; workshop window unmet
# Outcome: Prune
# ... (backtrack or try next sibling)

Model Selection and Configuration:
- Use models that support an "internal scaling" or "thinking" mode if available (like Claude 3.7 Sonnet or Qwen3). This mode allows the model to dedicate more internal computation.
- If such a mode isn't explicit, one might approximate internal scaling by simply allowing the model to generate a much longer response, or by prompting it to "think step-by-step for as long as needed."
Combining Strategies:
- The key is the combination: Use AoT or CoT prompting in conjunction with enabling the internal scaling/thinking mode. This allows the structured thought process (from AoT/CoT) to be executed with sufficient computational depth (from internal scaling).

Deployment Considerations:

Computational Cost: Internal scaling, especially when leading to exponentially longer thought processes, will increase inference time and cost. However, the paper shows that for practical instances of hard problems, this can still be feasible and yield significant gains.
Prompt Crafting: AoT, in particular, requires careful crafting of algorithmic examples, which can be domain-specific and labor-intensive.
Model Capability: The base model's inherent capabilities still matter. Some models might benefit more than others.

Conclusion

The research concludes that LLMs, when prompted with advanced in-context search strategies and augmented with internal test-time scaling, can solve tasks previously considered beyond their reach without external mechanisms or fine-tuning. This challenges prevailing assumptions about LLM limitations and calls for a reassessment of evaluation methodologies to better capture their true operational boundaries.

Limitations and Future Work:

Theoretical analysis primarily covers internal scaling; extension to parallel/sequential scaling is needed.
Exploring diverse in-context search algorithms beyond Depth-First Search.
Developing more automated and efficient reasoning strategies to reduce reliance on hand-crafted prompts.
Investigating hybrid test-time scaling approaches.
Understanding interactions with different model architectures and training paradigms.

This work provides strong evidence that the perceived "ceiling" of LLM reasoning is often a byproduct of suboptimal evaluation configurations rather than an intrinsic limitation, opening new avenues for applying LLMs to more complex problem-solving domains.

PDF Markdown

Rethinking the Unsolvable: When In-Context Search Meets Test-Time Scaling (2505.22290v1)

Summary

Related Papers