Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

ACPBench: Reasoning about Action, Change, and Planning (2410.05669v2)

Published 8 Oct 2024 in cs.AI

Abstract: There is an increasing body of work using LLMs as agents for orchestrating workflows and making decisions in domains that require planning and multi-step reasoning. As a result, it is imperative to evaluate LLMs on core skills required for planning. In this work, we present ACPBench, a benchmark for evaluating the reasoning tasks in the field of planning. The benchmark consists of 7 reasoning tasks over 13 planning domains. The collection is constructed from planning domains described in a formal language. This allows us to synthesize problems with provably correct solutions across many tasks and domains. Further, it allows us the luxury of scale without additional human effort, i.e., many additional problems can be created automatically. Our extensive evaluation of 22 LLMs and OpenAI o1 reasoning models highlights the significant gap in the reasoning capability of the LLMs. Our findings with OpenAI o1, a multi-turn reasoning model, reveal significant gains in performance on multiple-choice questions, yet surprisingly, no notable progress is made on boolean questions. The ACPBench collection is available at https://ibm.github.io/ACPBench.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. CoRR, abs/2404.14219.
  2. Bylander, T. 1994. The Computational Complexity of Propositional STRIPS Planning. AIJ, 69(1–2): 165–204.
  3. Navigate through Enigmatic Labyrinth A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future. In ACL. Association for Computational Linguistics.
  4. QLoRA: Efficient Finetuning of Quantized LLMs. In NeurIPS.
  5. The Llama 3 Herd of Models. arXiv:2407.21783.
  6. Formalizing Plan Justifications. In Proc. CSCSI 1992.
  7. Fišer, D. 2020. Lifted Fact-Alternating Mutex Groups and Pruned Grounding of Classical Planning Problems. In Proc. AAAI 2020, 9835–9842.
  8. Fact-Alternating Mutex Groups for Classical Planning. JAIR, 61: 475–521.
  9. CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing. In ICLR. OpenReview.net.
  10. ActionReasoningBench: Reasoning about Actions with and without Ramification Constraints. CoRR, abs/2406.04046.
  11. Exploring the Capacity of Pretrained Language Models for Reasoning about Actions and Change. In ACL. Association for Computational Linguistics.
  12. RIFO Revisited: Detecting Relaxed Irrelevance. In Proc. ECP 2001, 127–135.
  13. Ordered Landmarks in Planning. JAIR, 22: 215–278.
  14. Understanding the planning of LLM agents: A survey. CoRR, abs/2402.02716.
  15. K* Search Over Orbit Space for Top-k Planning. In Proc. IJCAI 2023.
  16. Reshaping Diverse Planning. In Proc. AAAI 2020, 9892–9899.
  17. Sound and Complete Landmarks for And/Or Graphs. In Proc. ECAI 2010, 335–340.
  18. Lin, F. 2004. Discovering State Invariants. In Proc. KR 2004, 536–544.
  19. AgentBench: Evaluating LLMs as Agents. CoRR, abs/2308.03688.
  20. AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents. CoRR, abs/2401.13178.
  21. Self-Refine: Iterative Refinement with Self-Feedback. In NeurIPS.
  22. McDermott, D. 2000. The 1998 AI Planning Systems Competition. AI Magazine, 21(2): 35–55.
  23. Granite Code Models: A Family of Open Foundation Models for Code Intelligence. CoRR, abs/2405.04324.
  24. MistralAI. 2024. Mixtral 8x22B. https://mistral.ai/news/mixtral-8x22b/.
  25. OpenAI. 2024a. GPT 4o. https://openai.com/index/hello-gpt-4o/.
  26. OpenAI. 2024b. Learning to Reason with LLMs. https://openai.com/index/learning-to-reason-with-llms/.
  27. On the Extraction, Ordering, and Usage of Landmarks in Planning. In Proc. ECP 2001, 174–182.
  28. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. In ICLR. OpenReview.net.
  29. Landmarks Revisited. In Proc. AAAI 2008, 975–982.
  30. Eliminating Redundant Actions from Plans using Classical Planning. In Proc. KR 2023, 774–778.
  31. PDDL Generators. https://doi.org/10.5281/zenodo.6382173.
  32. Reflexion: language agents with verbal reinforcement learning. In Proc. NeurIPS 2023.
  33. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In CVPR, 10737–10746. Computer Vision Foundation / IEEE.
  34. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. In Proceedings of the International Conference on Learning Representations (ICLR).
  35. AutoPlanBench: Automatically generating benchmarks for LLM planners from PDDL. arXiv:2311.09830 [cs.AI].
  36. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. In ACL (Findings), 13003–13051. Association for Computational Linguistics.
  37. PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change. In Proc. NeurIPS 2023, 38975–38987.
  38. On the Planning Abilities of Large Language Models - A Critical Investigation. In Proc. NeurIPS 2023.
  39. LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench. arXiv:2409.13373.
  40. A survey on large language model based autonomous agents. Frontiers Comput. Sci., 18(6): 186345.
  41. Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models. In ACL. Association for Computational Linguistics.
  42. MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback. In ICLR. OpenReview.net.
  43. Chain-of-thought prompting elicits reasoning in large language models. In Proc. NeurIPS 2022, 24824–24837.
  44. TravelPlanner: A Benchmark for Real-World Planning with Language Agents. In ICML.
  45. Tree of thoughts: Deliberate problem solving with large language models. In Proc. NeurIPS 2023.
  46. Landmark Extraction via Planning Graph Propagation. In ICAPS 2003 Doctoral Consortium, 156–160.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com