Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can Large Language Models Really Improve by Self-critiquing Their Own Plans? (2310.08118v1)

Published 12 Oct 2023 in cs.AI

Abstract: There have been widespread claims about LLMs being able to successfully verify or self-critique their candidate solutions in reasoning problems in an iterative mode. Intrigued by those claims, in this paper we set out to investigate the verification/self-critiquing abilities of LLMs in the context of planning. We evaluate a planning system that employs LLMs for both plan generation and verification. We assess the verifier LLM's performance against ground-truth verification, the impact of self-critiquing on plan generation, and the influence of varying feedback levels on system performance. Using GPT-4, a state-of-the-art LLM, for both generation and verification, our findings reveal that self-critiquing appears to diminish plan generation performance, especially when compared to systems with external, sound verifiers and the LLM verifiers in that system produce a notable number of false positives, compromising the system's reliability. Additionally, the nature of feedback, whether binary or detailed, showed minimal impact on plan generation. Collectively, our results cast doubt on the effectiveness of LLMs in a self-critiquing, iterative framework for planning tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)
  1. Lm vs lm: Detecting factual errors via cross examination. arXiv preprint arXiv:2305.13281, 2023.
  2. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325, 2023.
  3. Faith and fate: Limits of transformers on compositionality. arXiv preprint arXiv:2305.18654, 2023.
  4. VAL: Automatic plan validation, continuous effects and mixed initiative planning using PDDL. In 16th IEEE International Conference on Tools with Artificial Intelligence, pages 294–301. IEEE, 2004.
  5. IPC. International planning competition, 1998.
  6. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491, 2023.
  7. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  8. Pddl-the planning domain definition language. 1998.
  9. OpenAI. Gpt-4 technical report, 2023.
  10. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  11. PDDL planning with pretrained large language models. In NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022.
  12. On the planning abilities of large language models–a critical investigation. arXiv preprint arXiv:2305.15771, 2023.
  13. Daniel S Weld. An introduction to least commitment planning. AI magazine, 15(4):27–27, 1994.
  14. Large language models are reasoners with self-verification. arXiv preprint arXiv:2212.09561, 2022.
  15. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Karthik Valmeekam (17 papers)
  2. Matthew Marquez (6 papers)
  3. Subbarao Kambhampati (126 papers)
Citations (66)