Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Eliciting Reasoning in Language Models with Cognitive Tools (2506.12115v1)

Published 13 Jun 2025 in cs.CL and cs.AI

Abstract: The recent advent of reasoning models like OpenAI's o1 was met with excited speculation by the AI community about the mechanisms underlying these capabilities in closed models, followed by a rush of replication efforts, particularly from the open source community. These speculations were largely settled by the demonstration from DeepSeek-R1 that chains-of-thought and reinforcement learning (RL) can effectively replicate reasoning on top of base LLMs. However, it remains valuable to explore alternative methods for theoretically eliciting reasoning that could help elucidate the underlying mechanisms, as well as providing additional methods that may offer complementary benefits. Here, we build on the long-standing literature in cognitive psychology and cognitive architectures, which postulates that reasoning arises from the orchestrated, sequential execution of a set of modular, predetermined cognitive operations. Crucially, we implement this key idea within a modern agentic tool-calling framework. In particular, we endow an LLM with a small set of "cognitive tools" encapsulating specific reasoning operations, each executed by the LLM itself. Surprisingly, this simple strategy results in considerable gains in performance on standard mathematical reasoning benchmarks compared to base LLMs, for both closed and open-weight models. For instance, providing our "cognitive tools" to GPT-4.1 increases its pass@1 performance on AIME2024 from 26.7% to 43.3%, bringing it very close to the performance of o1-preview. In addition to its practical implications, this demonstration contributes to the debate regarding the role of post-training methods in eliciting reasoning in LLMs versus the role of inherent capabilities acquired during pre-training, and whether post-training merely uncovers these latent abilities.

This paper, "Eliciting Reasoning in LLMs with Cognitive Tools" (Ebouky et al., 13 Jun 2025 ), introduces a novel method to enhance the reasoning capabilities of LLMs by endowing them with a set of "cognitive tools." These tools encapsulate specific, modular reasoning operations that the LLM itself executes within an agentic tool-calling framework. The core idea is inspired by cognitive psychology and architectures, which posit that reasoning arises from the orchestrated, sequential execution of modular cognitive operations.

Problem Addressed:

The paper aims to explore alternative methods to elicit reasoning in LLMs, beyond prevalent techniques like reinforcement learning (RL) fine-tuning. It investigates whether structuring the LLM's internal processing using modular tools can unlock latent reasoning capabilities already present in base models.

Proposed Method: Cognitive Tools

The authors propose a framework where an LLM is provided with a small set of "cognitive tools." Unlike traditional agentic tools that call external APIs (e.g., calculators, search engines), these cognitive tools represent calls to the LLM itself, but with a specific prompt template designed to isolate a particular cognitive operation. The LLM decides when to call a tool, the tool is executed in a sandboxed context (using the same LLM instance), and its output is fed back into the main reasoning loop.

The paper defines four specific cognitive tools:

  1. Understand Question: Prompts the LLM to break down a problem, identify key concepts, extract relevant information, and highlight potentially useful theorems or techniques. This is inspired by goal management in cognitive architectures.
  2. Recall Related: Inspired by analogical reasoning, this tool prompts the LLM to retrieve and present examples of similar, previously solved problems along with their solutions, to guide the current reasoning process.
  3. Examine Answer: Implements a form of self-reflection. The LLM is prompted to check the current reasoning trace for flaws, wrong assumptions, miscalculations, or unaddressed constraints.
  4. Backtracking: When the LLM identifies a flaw or dead-end in its reasoning, this tool prompts it to summarize the current trace, identify the incorrect step, and propose alternative approaches or directions.

A system prompt guides the LLM on how to use these tools, encouraging their use for complex or ambiguous questions and allowing the LLM to decide which tool to call and when. The framework also permits the LLM to generate and use Python code as an additional modular tool.

Implementation Details:

The pipeline operates like a standard tool-calling system:

  1. The LLM generates a reasoning trace in response to a query.
  2. If a tool call is detected, generation stops.
  3. The specified cognitive tool (which is a dedicated prompt template) is executed by the same LLM in an isolated context.
  4. The tool's output is provided back to the LLM.
  5. The LLM continues its reasoning process, incorporating the tool's output, until it produces a final answer.

The Appendix provides specific prompts used for each cognitive tool and the code execution tool. For instance, the "Understand Question" prompt explicitly asks the LLM not to solve the problem but to analyze it and provide structured steps.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
user_query = "Solve the math problem: 'Find the GCD of 3339, 2961, and 1491.'"
current_context = system_prompt + user_query
final_answer = None

while not final_answer:
    LLM_response = LLM.generate(current_context) # LLM generates text, potentially a tool call
    
    if "ANSWER:" in LLM_response:
        final_answer = parse_answer(LLM_response)
        break
    
    tool_call = detect_tool_call(LLM_response) # e.g., "print(understand_question(...))"
    
    if tool_call:
        tool_name = tool_call.name
        tool_input = tool_call.arguments
        
        # Execute the cognitive tool (which is another LLM call with a specific prompt)
        if tool_name == "understand_question":
            tool_prompt = understand_question_prompt_template.format(question=tool_input['question'])
            tool_output = LLM.generate(tool_prompt) # Sandboxed execution
        elif tool_name == "recall_related":
            # ... similar logic for other tools
            pass
        elif tool_name == "examine_answer":
            # ...
            pass
        elif tool_name == "backtracking":
            # ...
            pass
        elif tool_name == "use_code":
            # Execute Python code and get its output
            code_to_run = tool_input['code']
            tool_output = execute_python_code(code_to_run)
            
        current_context += LLM_response + "\nObservation:\n" + tool_output + "\n"
    else:
        # No tool call, just append the reasoning
        current_context += LLM_response + "\n"

print(f"Final Answer: {final_answer}")

Experiments and Results:

The approach was evaluated on mathematical reasoning benchmarks: AIME 2024, MATH 500, AMC, and Smolbenchmark (math task).

  • Models: Open-weight models (Qwen2.5-7B/32B Instruct, Llama3.1-8B Instruct, Llama3.3-70B Instruct) and closed models (GPT-4.1, o1-preview).
  • Baselines: Out-of-the-box model performance and "cognitive prompting" (a monolithic prompt structuring reasoning steps).
  • Metrics: Pass@1 accuracy. For MATH500, an LLM-as-a-judge (GPT-4.1) was used for evaluation.

Key findings:

  • Individual Tool Impact: Each cognitive tool generally improved performance over the baseline, with varying impacts across different models (Table 1). For example, Llama3.3-70B Instruct saw a +26.7% jump on Smolbenchmark with the 'backtracking' tool.
  • Cognitive Tools vs. Cognitive Prompting: The modular cognitive tools consistently outperformed the monolithic cognitive prompting approach (Table 2). For Llama3.3-70B, cognitive tools achieved 80.0% on Smolbenchmark, compared to 66.0% for cognitive prompting and 52.8% for the baseline. This supports the hypothesis that modularity reduces interference and increases flexibility.
  • Main Results: Providing all cognitive tools simultaneously led to consistent and significant improvements across all tested models and benchmarks (Table 3, Figure 2). For Qwen2.5-32B Instruct, AIME 2024 accuracy increased from 17.2% to 32.1%. For Llama3.3-70B Instruct, it increased from 13.1% to 29.8%.
  • Comparison with RL-trained Models: When cognitive tools were added to GPT-4.1, its pass@1 performance on AIME 2024 increased from 26.7% to 43.3%, bringing it very close to the performance of o1-preview (44.6%), a model trained with RL for reasoning (Table 4).

Practical Implications and Contributions:

  1. Improved Reasoning without Retraining: Cognitive tools offer a way to significantly boost LLM reasoning performance on complex tasks like mathematical problem-solving without needing to fine-tune or retrain the base models.
  2. Modularity and Flexibility: The modular nature allows LLMs to focus on specific cognitive operations in isolation, reducing interference. It also provides flexibility, as the LLM decides which tool to use and when, rather than following a rigid, predefined sequence.
  3. Alternative to RL: The results suggest that structured, modular prompting can elicit strong reasoning capabilities inherent in base LLMs, potentially offering a more interpretable and less resource-intensive alternative or complement to RL fine-tuning for reasoning.
  4. Enhanced Transparency: Since reasoning steps can be associated with specific tool calls and their outputs, the process becomes more transparent and explainable.
  5. Bridging Cognitive Science and Agentic AI: The work translates principles from cognitive architectures (modular operations) into a modern agentic tool-calling framework, but for internal reasoning processes rather than external API calls.

Limitations:

  • Manual Tool Definition: The cognitive tools are currently manually defined. Future work could explore automated discovery of such operations.
  • Domain Specificity: Evaluated primarily on mathematical reasoning. The effectiveness and design of tools might need adaptation for other domains.
  • Prompt Engineering: The prompts implementing the cognitive tools might require re-engineering for different model families to achieve optimal performance.

Deployment Considerations:

  • Increased Latency: Each tool call involves an additional LLM inference step (or multiple, if the tool itself is complex), which will increase overall latency compared to a single LLM call.
  • Context Window Management: The history of tool calls and their outputs needs to be managed within the LLM's context window.
  • Computational Cost: More LLM calls mean higher computational cost per query. However, this might be offset by improved accuracy and the avoidance of expensive RL training.
  • Tool Orchestration Logic: The main LLM needs to be capable enough to understand when and how to use the provided tools effectively. The system prompt plays a crucial role here.

In conclusion, the paper presents a practical and effective method for eliciting and improving reasoning in LLMs by equipping them with self-executable, modular cognitive tools. This approach offers significant performance gains, particularly on challenging mathematical reasoning tasks, and provides insights into the nature of reasoning in LLMs, suggesting that much of this capability may be latent and unlockable via structured interaction.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Brown Ebouky (2 papers)
  2. Andrea Bartezzaghi (6 papers)
  3. Mattia Rigotti (30 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com