This paper, "Towards a Deeper Understanding of Reasoning Capabilities in LLMs" (Wong et al., 15 May 2025 ), investigates the reasoning and adaptive capabilities of LLMs as agents in dynamic environments, contrasting their performance on static benchmarks. The paper systematically evaluates three prompting techniques—self-reflection, heuristic mutation (via an Oracle), and planning—using the SmartPlay benchmark, which consists of interactive text-based games requiring skills beyond simple question-answering. The authors implement a framework where an agent interacts with an environment and can optionally receive guidance from Reflection, Oracle, and Planner modules at each step or episode (Figure 1).
Methodology and Implementation:
The core of the evaluation framework is an agent that takes actions based on its prompt, which includes the game manual, task description, history, current state, and possible actions. The prompting techniques explored are:
- Reflection: At each timestep, the agent receives a retrospective analysis of its past actions, rewards, and states within the current episode, aimed at identifying potential improvements. This is similar to the Reflexion approach [shinn2023reflexion]. Reflections are reset between episodes.
- Oracle: This module generates and refines general heuristics for the agent's policy across episodes using a (1+1) evolutionary strategy. Based on the trajectory and reflections from the previous episode, the Oracle mutates the current best heuristic set (the parent) to create an offspring. If the offspring performs better in the subsequent episode, it replaces the parent. These heuristics are consistent within an episode but evolve between episodes. This aims to minimize manual prompt engineering for strategy adaptation.
- Planner: This module is forward-looking. At each timestep, it simulates potential action sequences (up to three steps ahead) and recommends the action leading to the highest expected cumulative reward. The Planner uses the game manual, objectives, trajectory, reflection, and current observation as input.
The authors used several open-source LLMs: Llama3-8B, Mistral-Nemo-12b, DeepSeek-R1-14b, and Llama3.3-70B. Experiments were conducted on NVIDIA RTX 4090, RTX 3060, A40 GPUs for smaller models, and A100 machines for Llama3.3-70B, highlighting the significant computational resources required for evaluating larger models, especially with advanced prompting. The evaluation environments from SmartPlay [wu2023smartplay] included:
- Bandit: A simple exploration-exploitation task.
- Rock Paper Scissors (RPS): Requires learning opponent patterns and probabilistic reasoning.
- Tower of Hanoi: Demands planning, spatial reasoning, and strict instruction following regarding disk placement rules.
- Messenger: Tests long-text interpretation, spatial reasoning, probabilistic thinking, and navigation while avoiding an enemy.
Evaluations ran for 20 episodes, with messenger extended to a 10-step horizon due to task difficulty. Performance metrics varied by game (e.g., optimal actions, disks moved, reward). Three runs were conducted per model-strategy pair due to computational constraints.
Key Findings and Practical Implications:
The paper's results yield several practical insights for deploying LLMs as agents:
- Model Size Matters: Larger models generally achieve higher performance across tasks, confirming scaling laws [kaplan2020scaling]. The performance gap between smaller and larger models is more pronounced in complex tasks like Hanoi.
- Prompting Effects Vary: Strategic prompting can help smaller/mid-sized models match or exceed the baseline performance of larger models on specific tasks (e.g., Llama3-8B with Reflection + Oracle on RPS, Mistral-Nemo-12b with Reflection + Planner on Messenger).
- Excessive Prompting Can Hurt: For simple reactive tasks like Bandit, adding complex reasoning (Reflection + Planner) degraded the performance of smaller models compared to a simpler base prompt. This aligns with findings that excessive context or reasoning can dilute relevant information and cause 'overthinking' [tworkowski2023focused, liu2023lost, chen2024dont].
- Strategies Benefit Different Skills: Correlation analysis (Figure 3) indicates that strategies generally improve Instruction Following across all models. For smaller and mid-sized models, they also aid Long-Text Understanding. For the largest model (Llama3.3-70B), strategies show stronger correlations with Learning from Interactions and Generalization.
- Advanced Reasoning is Brittle: While advanced strategies can lead to peak performance improvements, they also introduce significant variability (large min-max score ranges in Table 1). A strategy that works well in one run might perform poorly in another, making these approaches unreliable for deployment without extensive testing.
- Sparse Rewards are Challenging: In environments like Hanoi and Messenger with sparse rewards (only given for completing sub-goals or the final task), agents struggled significantly. Failures included invalid moves, repetitive actions, object misidentification, and poor spatial awareness (Example 2). Reward shaping (providing denser, incremental rewards) improved some aspects like message pickup rates in Messenger but did not consistently lead to higher goal completion rates or overcome fundamental spatial/planning limitations (Table 2, Table 3). This suggests that LLMs struggle to learn effective long-term strategies solely from sparse environmental feedback.
- Lack of True Emergent Reasoning: Despite using advanced prompting techniques, the agents showed little evidence of true self-learning or emergent reasoning in tasks requiring complex planning and spatial coordination. Common failure modes involved hallucinating invalid actions or getting stuck in loops, indicating limitations not easily overcome by in-context prompting alone.
Implementation Considerations:
For practitioners looking to implement LLM agents for dynamic tasks based on this research, key considerations include:
- Model Selection: Larger models offer better baseline performance, but computational cost increases dramatically. Mid-sized models might be viable with strategic prompting for specific tasks, but their performance is less reliable.
- Prompt Engineering vs. Learned Strategies: The Oracle attempts to automate heuristic generation, reducing manual prompt engineering per episode. However, the overall brittleness of advanced strategies suggests that finding truly effective, robust prompts or learned strategies for complex dynamic tasks remains challenging.
- Context Management: The paper highlights that increasing prompt length with reasoning steps can negatively impact smaller models. Efficient context management and ensuring signal-to-noise ratio are crucial.
- Environment Design & Reward: For tasks where LLMs struggle with planning and spatial reasoning, modifying the environment or providing denser, task-aligned reward signals can help. However, as shown in the Hanoi experiments, even explicit hints about valid actions or reward shaping don't guarantee success if the core understanding of task constraints is lacking.
- Evaluation Beyond Aggregate Metrics: The high variability in performance across runs for advanced strategies suggests that simply reporting average scores (common in static benchmarks) is insufficient for evaluating agent capabilities in dynamic environments. Minimum, median, and maximum scores, along with qualitative analysis of failure modes, provide a more complete picture.
The paper concludes that current LLMs still have fundamental shortcomings in general reasoning, planning, and spatial coordination when faced with dynamic challenges. It argues for moving beyond static benchmarks and calls for future work integrating in-context learning with external memory, symbolic abstractions, and multimodal perception to build more capable and reliable AI agents.