Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games (2310.01468v3)

Published 2 Oct 2023 in cs.CL, cs.AI, cs.HC, and cs.LG

Abstract: LLMs are effective at answering questions that are clearly asked. However, when faced with ambiguous queries they can act unpredictably and produce incorrect outputs. This underscores the need for the development of intelligent agents capable of asking clarification questions to resolve ambiguities effectively. This capability requires complex understanding, state tracking, reasoning and planning over multiple conversational turns. However, directly measuring this can be challenging. In this paper, we offer a surrogate problem which assesses an LLMs' capability to deduce an entity unknown to itself, but revealed to a judge, by asking the judge a series of queries. This \textit{entity-deducing game} can serve as an evaluation framework to probe the conversational reasoning and planning capabilities of LLMs. We systematically evaluate various LLMs and discover significant differences in their performance on this task. We find that strong LLMs like GPT-4 outperform human players by a large margin. We further employ Behavior Cloning (BC) to examine whether a weaker model is capable of imitating a stronger model and generalizing to data or domains, using only the demonstrations from a stronger model. We finally propose to use Reinforcement Learning to enhance reasoning and planning capacity of Vicuna models through episodes of game playing, which lead to significant performance improvement. We hope that this problem offers insights into how autonomous agents could be trained to behave more intelligently in ambiguous circumstances.

This paper introduces the Entity-Deduction Arena (EDA), a framework based on the 20 Questions (Q20) game, to evaluate the multi-turn conversational planning and reasoning capabilities of LLMs. The motivation stems from the observation that while LLMs excel at answering clear questions, they struggle with ambiguity and often fail to ask clarifying questions, a crucial skill for intelligent agents in real-world scenarios like task completion or conversational search.

The EDA framework involves two players: a "judge" (J) who knows a secret entity and answers questions with "Yes," "No," or "Maybe" (or "Dunno" for the Celebrities dataset), and a "guesser" (G), the LLM being evaluated, which must deduce the entity by asking a series of questions in as few turns as possible (up to 20). Success requires the guesser to perform state tracking (understanding context, history), strategic planning (asking efficient, non-redundant questions), and inductive reasoning (forming hypotheses based on answers).

Implementation Details:

  • Judge: GPT-3.5-turbo is used as the judge, prompted to only provide constrained answers based on its knowledge of the entity. A low temperature (0.2) is used for more deterministic responses. Judge error rates were found to be low (~3%).
  • Guesser: Various LLMs are evaluated as the guesser, using prompts instructing them to deduce the entity via questions. A higher temperature (0.8) is used for diverse outputs.
  • Datasets: Two datasets were created: "Things" (500 common objects/concepts) and "Celebrities" (500 names across nationalities, eras, occupations). Each is split into train/eval/test sets.
  • Evaluation Metrics:
    • #Turns: Average game length (lower is better).
    • Success rate: Percentage of games won (exact match on final guess).
    • #Yes: Average number of "Yes" answers received (indicator of question quality).
    • Score: A combined metric rewarding success and penalizing turns beyond 5: S=(10.02max(#Turns5,0))S = (1 - 0.02 \cdot \max(\#Turns - 5, 0)) if won, else 0.

Benchmarking Results:

  • Several LLMs (GPT-4, GPT-3.5, Claude-1/2, Vicuna-7B/13B, Mistral-7B) were benchmarked.
  • GPT-4 significantly outperformed other models and human players on both datasets.
  • Stronger models generally showed better planning (fewer redundant questions, better partitioning of possibilities) and reasoning (avoiding inconsistent questions/guesses).
  • Weaker models often failed due to "Early Enumeration," "Redundancy," or "Inconsistency."
  • Open-source models like Vicuna-13B and Mistral-7B showed promise, sometimes outperforming older closed models like Claude-1.

Analysis and Enhancement:

  • Strategy Probing (RQ1): By prompting the guesser for its top-5 entity candidates at each turn, the paper found that strong models like GPT-4 seem to maintain an internal state/taxonomy, asking questions to efficiently partition the likely candidates. They can also backtrack when uncertain or when realizing a category was overlooked.
  • Planning vs. Reasoning (RQ2): An experiment swapping models only for the final guess showed that both strong planning (generating an informative game trajectory) and strong reasoning (making the final correct guess based on the trajectory) are crucial and synergistic. Poor planning makes the final reasoning step very difficult, even for a strong model.
  • Behavior Cloning (BC) (RQ3-5):
    • Fine-tuning Vicuna models on game demonstrations generated by GPT-3.5 significantly improved their performance, showing weaker models can imitate stronger ones (RQ3).
    • Training only on successful demonstrations ("V-FT (Suc.)") yielded better results than training on all demonstrations ("V-FT (All)"), especially on the "Things" dataset (RQ4).
    • Larger models (Vicuna-13B) also improved with BC, but the relative gain was smaller than for the 7B model. Improvements were less pronounced on the "Celebrities" dataset, suggesting imitation might be harder for tasks requiring more specific knowledge or strategy (RQ5).
  • Reinforcement Learning from Game-Play (RLGP):
    • Using Proximal Policy Optimization (PPO) with rewards based on the game score (final reward) and receiving "Yes" answers (intermediate reward), Vicuna models were further improved.
    • RLGP-trained models (V-RLGP) outperformed their BC-finetuned counterparts on the in-domain "Things" dataset, with V-RLGP 13B matching GPT-3.5 performance.
    • Some generalization improvement was observed on the out-of-domain "Celebrities" dataset.
  • Breakdown Analysis: Models showed varying strengths on different entities. RLGP tended to improve success on items the base model already had some chance on, while BC was better at enabling success on entirely new items.

The paper concludes that the EDA benchmark effectively probes LLM planning and reasoning. State-of-the-art LLMs demonstrate these abilities to some extent, and techniques like Behavior Cloning and Reinforcement Learning can enhance these capabilities in open-source models. The code and datasets are released to facilitate future research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models. 2023.
  2. Akinator. Akinator, 2007. URL https://en.akinator.com. Accessed on September 7, 2023.
  3. Asking clarifying questions in open-domain information-seeking conversations. In Proceedings of the 42nd international acm sigir conference on research and development in information retrieval, pp.  475–484, 2019.
  4. Anton Osika. Gpt Engineer, 2023. URL https://github.com/AntonOsika/gpt-engineer/commits?author=AntonOsika. GitHub repository.
  5. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. trlx: A Scalable Framework for Rlhf, June 2023. URL https://github.com/CarperAI/trlx.
  8. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  10. Contrastive multi-document question generation. arXiv preprint arXiv:1911.03047, 2019.
  11. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  12. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  13. Guesswhat?! visual object discovery through multi-modal dialogue. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  5503–5512, 2017.
  14. Towards end-to-end reinforcement learning of dialogue agents for information access. arXiv preprint arXiv:1609.00777, 2016.
  15. Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance. arXiv preprint arXiv:2305.17306, 2023.
  16. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  17. Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022.
  18. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  19. ReferItGame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  787–798, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1086. URL https://aclanthology.org/D14-1086.
  20. Llms as factual reasoners: Insights from existing benchmarks and beyond. arXiv preprint arXiv:2305.14540, 2023.
  21. Langchain-AI. langchain Github Repository, 2023. URL https://github.com/langchain-ai/langchain. GitHub repository.
  22. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  23. Yohei Nakajima. Babyagi, 2023. URL https://github.com/yoheinakajima/babyagi. GitHub repository.
  24. OpenAI. Gpt-4 technical report, 2023.
  25. Language models as knowledge bases? In EMNLP, 2019.
  26. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  27. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  5418–5426, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.437. URL https://aclanthology.org/2020.emnlp-main.437.
  28. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  4938–4947, 2020.
  29. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  30. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
  31. Significant Gravitas. auto-gpt: An Autonomous Gpt-4 Experiment, 2023. URL https://github.com/Significant-Gravitas/Auto-GPT. GitHub repository.
  32. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  33. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  34. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  35. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  36. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498, 2022.
  37. On the planning abilities of large language models–a critical investigation. arXiv preprint arXiv:2305.15771, 2023.
  38. Labeling images with a computer game. In Proceedings of the SIGCHI conference on Human factors in computing systems, pp.  319–326, 2004.
  39. Peekaboom: a game for locating objects in images. In Proceedings of the SIGCHI conference on Human Factors in computing systems, pp.  55–64, 2006.
  40. Translating natural language to planning goals with large-language models. arXiv preprint arXiv:2302.05128, 2023.
  41. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  42. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yizhe Zhang (127 papers)
  2. Jiarui Lu (31 papers)
  3. Navdeep Jaitly (67 papers)
Citations (5)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com