On the Brittle Foundations of ReAct Prompting for Agentic Large Language Models (2405.13966v1)

Published 22 May 2024 in cs.AI and cs.CL

Abstract: The reasoning abilities of LLMs remain a topic of debate. Some methods such as ReAct-based prompting, have gained popularity for claiming to enhance sequential decision-making abilities of agentic LLMs. However, it is unclear what is the source of improvement in LLM reasoning with ReAct based prompting. In this paper we examine these claims of ReAct based prompting in improving agentic LLMs for sequential decision-making. By introducing systematic variations to the input prompt we perform a sensitivity analysis along the claims of ReAct and find that the performance is minimally influenced by the "interleaving reasoning trace with action execution" or the content of the generated reasoning traces in ReAct, contrary to original claims and common usage. Instead, the performance of LLMs is driven by the similarity between input example tasks and queries, implicitly forcing the prompt designer to provide instance-specific examples which significantly increases the cognitive burden on the human. Our investigation shows that the perceived reasoning abilities of LLMs stem from the exemplar-query similarity and approximate retrieval rather than any inherent reasoning abilities.

PDF HTML Abstract

The paper "On the Brittle Foundations of ReAct Prompting for Agentic LLMs" (Verma et al., 22 May 2024 ) critically examines the efficacy and underlying mechanisms of ReAct prompting in enhancing the sequential decision-making abilities of agentic LLMs. Contrary to the prevailing belief that ReAct's interleaving of reasoning traces with action execution is the primary driver of improved performance, the paper posits that the performance of LLMs under ReAct prompting is predominantly influenced by the similarity between the example tasks provided in the prompt and the query task itself. This dependence on exemplar-query similarity questions the purported emergent reasoning abilities of LLMs and places a significant cognitive burden on prompt engineers.

Sensitivity Analysis of ReAct Claims

The authors conduct a meticulous sensitivity analysis by systematically varying the input prompt along multiple dimensions to deconstruct the claims made by ReAct. These variations target three primary aspects: the interleaving of reasoning trace with action execution, the nature of the reasoning trace, and the similarity between the example and the query.

Interleaving Reasoning Trace with Action Execution (RQ1)

ReAct's original claim emphasizes the importance of interleaving reasoning steps (the "think" steps) with action execution for improved planning. To challenge this claim, the paper experiments with variations where the reasoning trace is collated into a single "think" step before action execution, analogous to Chain-of-Thought prompting ("Exemplar-based CoT" and "Anonymized Exemplar-CoT"). In "Anonymized Exemplar-CoT," the reasoning is further generalized by removing specific object and location references. The findings indicate that LLM performance improves when the reasoning trace is not interleaved with action execution, particularly for GPT models, which contradicts ReAct's core assertion. Claude-Opus showed a slight dip in performance, but its success rate was still reasonably high.

Nature of Reasoning Trace/Guidance Information (RQ2)

ReAct posits that the reasoning trace provides valuable plan guidance, thereby enhancing LLM performance. The paper tests the impact of different types of guidance information within the "think" tags. This is done via the introduction of invalid actions and the simulator's response ("Nothing happens.") into the example prompts, augmenting the failure examples with explanations for the invalid actions, reversing the order of subtasks within the reasoning trace, and replacing the task-relevant reasoning with a generic prompt engineering trick ("Take a deep breath and work on this problem step-by-step"). The paper reveals that weaker guidance (failure examples) can improve performance, and placebo guidance yields comparable performance to strong, reasoning-based guidance. Moreover, the ordering of the reasoning trace has little impact. These results suggest that the content of the reasoning trace is not the primary determinant of performance.

Similarity Between Example and Query (RQ3)

The paper directly examines the impact of similarity between example problems in the prompt and the query problem on LLM performance. The authors introduce variations in the example prompts, including replacing object and location names with synonyms, changing the goal location and adding repetitive, futile actions to the example, using examples from different tasks within the AlfWorld domain (Put, Clean, Heat, Cool, Examine, PutTwo), and providing examples with optimal, shortest-path solutions. The most salient finding is that even minor variations in the example prompt, such as using synonyms or providing examples from different but related tasks, can drastically reduce LLM performance. Instance-specific examples are critical for success, underscoring the LLM's dependence on the similarity of the exemplars to the query task. In variations such as ‘Unrolling’ and ‘Subtask Similarity’, LLM performance was also significantly impacted.

Implications for LLM Reasoning

The paper's findings challenge the notion that ReAct-based prompting genuinely enhances the reasoning abilities of LLMs. Instead, the observed performance appears to be driven by pattern matching and approximate retrieval from the prompt, contingent on a high degree of similarity between the example tasks and the query task. This places a considerable burden on prompt engineers to create instance-specific examples, which may not be scalable for complex or diverse domains. The paper casts doubt on claims of enhanced "emergent reasoning" in LLMs through prompt engineering techniques like ReAct, aligning with other research questioning the true reasoning capabilities of these models.

In conclusion, the paper provides a nuanced perspective on the effectiveness of ReAct prompting, highlighting its limitations and underlying dependencies. The sensitivity analysis reveals that the perceived reasoning abilities of LLMs are more attributable to exemplar-query similarity than to the inherent design of ReAct itself. This underscores the need for a more critical evaluation of prompt engineering techniques and their impact on the purported reasoning capabilities of LLMs.