Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Brittle Foundations of ReAct Prompting for Agentic Large Language Models (2405.13966v1)

Published 22 May 2024 in cs.AI and cs.CL

Abstract: The reasoning abilities of LLMs remain a topic of debate. Some methods such as ReAct-based prompting, have gained popularity for claiming to enhance sequential decision-making abilities of agentic LLMs. However, it is unclear what is the source of improvement in LLM reasoning with ReAct based prompting. In this paper we examine these claims of ReAct based prompting in improving agentic LLMs for sequential decision-making. By introducing systematic variations to the input prompt we perform a sensitivity analysis along the claims of ReAct and find that the performance is minimally influenced by the "interleaving reasoning trace with action execution" or the content of the generated reasoning traces in ReAct, contrary to original claims and common usage. Instead, the performance of LLMs is driven by the similarity between input example tasks and queries, implicitly forcing the prompt designer to provide instance-specific examples which significantly increases the cognitive burden on the human. Our investigation shows that the perceived reasoning abilities of LLMs stem from the exemplar-query similarity and approximate retrieval rather than any inherent reasoning abilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Rest meets react: Self-improvement for multi-step reasoning llm agent. arXiv preprint arXiv:2312.10003, 2023.
  2. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 2024.
  3. Towards llm-guided causal explainability for black-box text classifiers. 2024.
  4. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  5. Hammr: Hierarchical multimodal react agents for generic vqa. arXiv preprint arXiv:2404.05465, 2024.
  6. Receive, reason, and react: Drive as you say, with large language models in autonomous vehicles. IEEE Intelligent Transportation Systems Magazine, 2024.
  7. Preact: Predicting future in react enhances agent’s planning ability. arXiv preprint arXiv:2402.11534, 2024.
  8. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023.
  9. Louie Giray. Prompt engineering with chatgpt: a guide for academic writers. Annals of biomedical engineering, 51(12):2629–2633, 2023.
  10. Llms can’t plan, but can help planning in llm-modulo frameworks. arXiv preprint arXiv:2402.01817, 2024.
  11. Chatgpt: Jack of all trades, master of none. Information Fusion, 99:101861, 2023.
  12. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  13. Jieyi Long. Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291, 2023.
  14. Prompt engineering in large language models. In International Conference on Data Intelligence and Cognitive Informatics, pages 387–402. Springer, 2023.
  15. Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638, 2023.
  16. OpenReview. ReAct: Synergizing Reasoning and Acting in Language Models. https://openreview.net/forum?id=WE_vluYUL-X, 2024.
  17. Flap: Flow adhering planning with constrained decoding in llms. arXiv preprint arXiv:2403.05766, 2024.
  18. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv preprint arXiv:2402.07927, 2024.
  19. Invalid logic, equivalent gains: The bizarreness of reasoning in language model prompting. arXiv preprint arXiv:2307.10573, 2023.
  20. Learning to repeat: Fine grained action repetition for deep reinforcement learning. arXiv preprint arXiv:1702.06054, 2017.
  21. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  22. Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768, 2020.
  23. On the self-verification limitations of large language models on reasoning and planning tasks. arXiv preprint arXiv:2402.08115, 2024.
  24. Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355, 2018.
  25. Tomer Ullman. Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399, 2023.
  26. On the planning abilities of large language models-a critical investigation. Advances in Neural Information Processing Systems, 36, 2024.
  27. Theory of mind abilities of large language models in human-robot interaction: An illusion? In Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, pages 36–45, 2024.
  28. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  29. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.
  30. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018.
  31. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2022.
  32. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  33. Beyond chain-of-thought, effective graph-of-thought reasoning in large language models. arXiv preprint arXiv:2305.16582, 2023.
  34. Reactable: Enhancing react for table question answering. arXiv preprint arXiv:2310.00815, 2023.
  35. Can chatgpt reproduce human-generated labels? a study of social computing tasks. arXiv preprint arXiv:2304.10145, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Mudit Verma (25 papers)
  2. Siddhant Bhambri (16 papers)
  3. Subbarao Kambhampati (126 papers)
Citations (3)

Summary

The paper "On the Brittle Foundations of ReAct Prompting for Agentic LLMs" (Verma et al., 22 May 2024 ) critically examines the efficacy and underlying mechanisms of ReAct prompting in enhancing the sequential decision-making abilities of agentic LLMs. Contrary to the prevailing belief that ReAct's interleaving of reasoning traces with action execution is the primary driver of improved performance, the paper posits that the performance of LLMs under ReAct prompting is predominantly influenced by the similarity between the example tasks provided in the prompt and the query task itself. This dependence on exemplar-query similarity questions the purported emergent reasoning abilities of LLMs and places a significant cognitive burden on prompt engineers.

Sensitivity Analysis of ReAct Claims

The authors conduct a meticulous sensitivity analysis by systematically varying the input prompt along multiple dimensions to deconstruct the claims made by ReAct. These variations target three primary aspects: the interleaving of reasoning trace with action execution, the nature of the reasoning trace, and the similarity between the example and the query.

Interleaving Reasoning Trace with Action Execution (RQ1)

ReAct's original claim emphasizes the importance of interleaving reasoning steps (the "think" steps) with action execution for improved planning. To challenge this claim, the paper experiments with variations where the reasoning trace is collated into a single "think" step before action execution, analogous to Chain-of-Thought prompting ("Exemplar-based CoT" and "Anonymized Exemplar-CoT"). In "Anonymized Exemplar-CoT," the reasoning is further generalized by removing specific object and location references. The findings indicate that LLM performance improves when the reasoning trace is not interleaved with action execution, particularly for GPT models, which contradicts ReAct's core assertion. Claude-Opus showed a slight dip in performance, but its success rate was still reasonably high.

Nature of Reasoning Trace/Guidance Information (RQ2)

ReAct posits that the reasoning trace provides valuable plan guidance, thereby enhancing LLM performance. The paper tests the impact of different types of guidance information within the "think" tags. This is done via the introduction of invalid actions and the simulator's response ("Nothing happens.") into the example prompts, augmenting the failure examples with explanations for the invalid actions, reversing the order of subtasks within the reasoning trace, and replacing the task-relevant reasoning with a generic prompt engineering trick ("Take a deep breath and work on this problem step-by-step"). The paper reveals that weaker guidance (failure examples) can improve performance, and placebo guidance yields comparable performance to strong, reasoning-based guidance. Moreover, the ordering of the reasoning trace has little impact. These results suggest that the content of the reasoning trace is not the primary determinant of performance.

Similarity Between Example and Query (RQ3)

The paper directly examines the impact of similarity between example problems in the prompt and the query problem on LLM performance. The authors introduce variations in the example prompts, including replacing object and location names with synonyms, changing the goal location and adding repetitive, futile actions to the example, using examples from different tasks within the AlfWorld domain (Put, Clean, Heat, Cool, Examine, PutTwo), and providing examples with optimal, shortest-path solutions. The most salient finding is that even minor variations in the example prompt, such as using synonyms or providing examples from different but related tasks, can drastically reduce LLM performance. Instance-specific examples are critical for success, underscoring the LLM's dependence on the similarity of the exemplars to the query task. In variations such as ‘Unrolling’ and ‘Subtask Similarity’, LLM performance was also significantly impacted.

Implications for LLM Reasoning

The paper's findings challenge the notion that ReAct-based prompting genuinely enhances the reasoning abilities of LLMs. Instead, the observed performance appears to be driven by pattern matching and approximate retrieval from the prompt, contingent on a high degree of similarity between the example tasks and the query task. This places a considerable burden on prompt engineers to create instance-specific examples, which may not be scalable for complex or diverse domains. The paper casts doubt on claims of enhanced "emergent reasoning" in LLMs through prompt engineering techniques like ReAct, aligning with other research questioning the true reasoning capabilities of these models.

In conclusion, the paper provides a nuanced perspective on the effectiveness of ReAct prompting, highlighting its limitations and underlying dependencies. The sensitivity analysis reveals that the perceived reasoning abilities of LLMs are more attributable to exemplar-query similarity than to the inherent design of ReAct itself. This underscores the need for a more critical evaluation of prompt engineering techniques and their impact on the purported reasoning capabilities of LLMs.