On the Planning Abilities of Large Language Models : A Critical Investigation (2305.15771v2)

Published 25 May 2023 in cs.AI

Abstract: Intrigued by the claims of emergent reasoning capabilities in LLMs trained on general web corpora, in this paper, we set out to investigate their planning capabilities. We aim to evaluate (1) the effectiveness of LLMs in generating plans autonomously in commonsense planning tasks and (2) the potential of LLMs in LLM-Modulo settings where they act as a source of heuristic guidance for external planners and verifiers. We conduct a systematic study by generating a suite of instances on domains similar to the ones employed in the International Planning Competition and evaluate LLMs in two distinct modes: autonomous and heuristic. Our findings reveal that LLMs' ability to generate executable plans autonomously is rather limited, with the best model (GPT-4) having an average success rate of ~12% across the domains. However, the results in the LLM-Modulo setting show more promise. In the LLM-Modulo setting, we demonstrate that LLM-generated plans can improve the search process for underlying sound planners and additionally show that external verifiers can help provide feedback on the generated plans and back-prompt the LLM for better plan generation.

PDF Abstract

Overview of "On the Planning Abilities of LLMs: A Critical Investigation"

The paper "On the Planning Abilities of LLMs: A Critical Investigation" by Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati aims to rigorously evaluate the planning capabilities of LLMs, specifically examining their effectiveness both in autonomous settings and as heuristic sources for other planning agents.

Objectives and Methodology

The paper primarily examines two questions:

Autonomous Planning: Can LLMs generate executable plans autonomously for commonsense planning tasks?
Heuristic Guidance: Can LLMs, in an LLM-Modulo setting, provide useful heuristic guidance to external planners and verifiers?

For this investigation, the researchers employed a wide range of LLMs, including GPT-4, GPT-3.5, InstructGPT-3.5, InstructGPT-3, GPT-3, and BLOOM. The evaluation utilized domains similar to those in the International Planning Competition (IPC), with a balanced mix of classic planning problems like Blocksworld and more challenging setups involving obfuscated names for actions and objects.

Key Findings

Autonomous Planning:
- LLMs demonstrate limited success in generating correct plans autonomously. The best model, GPT-4, achieved only a ~12% success rate in executing plans without errors on average.
- Performance further deteriorates when obfuscated names are used, indicating that LLMs likely rely on pattern-matching rather than robust reasoning.
Heuristic Guidance:
- The LLM-Modulo settings show more potential. LLM-generated plans can effectively guide underlying sound planners:
  - In conjunction with the LPG (Local Search) planner, GPT-4 reduced the number of search steps needed to find a correct plan significantly compared to empty or random seed plans.
- Backprompting with feedback from an external verifier (VAL) improved the LLM's plan quality in subsequent iterations. GPT-4 corrected its plans in up to 82% of instances on the Blocksworld task when provided with feedback.

Implications and Future Directions

Theoretical Implications

The results reinforce the notion that while LLMs have impressive breadth in pattern recognition due to their extensive training on web data, they fall short on tasks that require deep combinatorial search and logical reasoning. This underscores the current limitations of LLMs in automating planning tasks that have traditionally relied on well-vetted domain-specific algorithms.

Practical Implications

Despite their limitations in autonomous modes, LLMs offer promising utility as heuristic sources that can guide more robust planning algorithms. This creates opportunities for hybrid systems where LLMs contribute creatively generated drafts that are then refined by traditional planning systems. Additionally, backprompting mechanisms can be leveraged to iteratively enhance the quality of LLM-generated plans, making them more reliable and applicable in real-world settings where both correct and executable plans are imperative.

Future Developments

Future research can explore several promising avenues:

Enhanced Verification Mechanisms: Automation of feedback loops using external verifiers can be streamlined and enhanced, incorporating more sophisticated models of feedback that handle not just errors but optimize the planning process.
Domain-Invariant Planning: Further research into making LLMs robust against variations in domain specifications, including obfuscated and randomized domains, could enhance their utility.
Domain-Specific Fine-Tuning: Although initial attempts of fine-tuning GPT-3 showed limited success, more nuanced fine-tuning strategies or combining multiple models may yield better results.

In conclusion, while LLMs currently lack the internal reasoning mechanisms to autonomously generate perfect plans in complex domains, they show significant promise as heuristic collaborators within hybrid planning systems. This dual approach leverages the strengths of LLMs in creative, broad-stroke generation and the precise, fine-grained search capabilities of traditional planners. The evolving role of LLMs in the planning landscape is thus both a testament to their capabilities and an acknowledgment of their limitations, encouraging a symbiotic integration into automated planning tasks.