Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games (2310.01468v3)

Published 2 Oct 2023 in cs.CL, cs.AI, cs.HC, and cs.LG

Abstract: LLMs are effective at answering questions that are clearly asked. However, when faced with ambiguous queries they can act unpredictably and produce incorrect outputs. This underscores the need for the development of intelligent agents capable of asking clarification questions to resolve ambiguities effectively. This capability requires complex understanding, state tracking, reasoning and planning over multiple conversational turns. However, directly measuring this can be challenging. In this paper, we offer a surrogate problem which assesses an LLMs' capability to deduce an entity unknown to itself, but revealed to a judge, by asking the judge a series of queries. This \textit{entity-deducing game} can serve as an evaluation framework to probe the conversational reasoning and planning capabilities of LLMs. We systematically evaluate various LLMs and discover significant differences in their performance on this task. We find that strong LLMs like GPT-4 outperform human players by a large margin. We further employ Behavior Cloning (BC) to examine whether a weaker model is capable of imitating a stronger model and generalizing to data or domains, using only the demonstrations from a stronger model. We finally propose to use Reinforcement Learning to enhance reasoning and planning capacity of Vicuna models through episodes of game playing, which lead to significant performance improvement. We hope that this problem offers insights into how autonomous agents could be trained to behave more intelligently in ambiguous circumstances.

PDF HTML Abstract

This paper introduces the Entity-Deduction Arena (EDA), a framework based on the 20 Questions (Q20) game, to evaluate the multi-turn conversational planning and reasoning capabilities of LLMs. The motivation stems from the observation that while LLMs excel at answering clear questions, they struggle with ambiguity and often fail to ask clarifying questions, a crucial skill for intelligent agents in real-world scenarios like task completion or conversational search.

The EDA framework involves two players: a "judge" (J) who knows a secret entity and answers questions with "Yes," "No," or "Maybe" (or "Dunno" for the Celebrities dataset), and a "guesser" (G), the LLM being evaluated, which must deduce the entity by asking a series of questions in as few turns as possible (up to 20). Success requires the guesser to perform state tracking (understanding context, history), strategic planning (asking efficient, non-redundant questions), and inductive reasoning (forming hypotheses based on answers).

Implementation Details:

Judge: GPT-3.5-turbo is used as the judge, prompted to only provide constrained answers based on its knowledge of the entity. A low temperature (0.2) is used for more deterministic responses. Judge error rates were found to be low (~3%).
Guesser: Various LLMs are evaluated as the guesser, using prompts instructing them to deduce the entity via questions. A higher temperature (0.8) is used for diverse outputs.
Datasets: Two datasets were created: "Things" (500 common objects/concepts) and "Celebrities" (500 names across nationalities, eras, occupations). Each is split into train/eval/test sets.
Evaluation Metrics:
- #Turns: Average game length (lower is better).
- Success rate: Percentage of games won (exact match on final guess).
- #Yes: Average number of "Yes" answers received (indicator of question quality).
- Score: A combined metric rewarding success and penalizing turns beyond 5: $S = (1 - 0.02 \cdot \max(\#Turns - 5, 0))$ if won, else 0.

Benchmarking Results:

Several LLMs (GPT-4, GPT-3.5, Claude-1/2, Vicuna-7B/13B, Mistral-7B) were benchmarked.
GPT-4 significantly outperformed other models and human players on both datasets.
Stronger models generally showed better planning (fewer redundant questions, better partitioning of possibilities) and reasoning (avoiding inconsistent questions/guesses).
Weaker models often failed due to "Early Enumeration," "Redundancy," or "Inconsistency."
Open-source models like Vicuna-13B and Mistral-7B showed promise, sometimes outperforming older closed models like Claude-1.

Analysis and Enhancement:

Strategy Probing (RQ1): By prompting the guesser for its top-5 entity candidates at each turn, the paper found that strong models like GPT-4 seem to maintain an internal state/taxonomy, asking questions to efficiently partition the likely candidates. They can also backtrack when uncertain or when realizing a category was overlooked.
Planning vs. Reasoning (RQ2): An experiment swapping models only for the final guess showed that both strong planning (generating an informative game trajectory) and strong reasoning (making the final correct guess based on the trajectory) are crucial and synergistic. Poor planning makes the final reasoning step very difficult, even for a strong model.
Behavior Cloning (BC) (RQ3-5):
- Fine-tuning Vicuna models on game demonstrations generated by GPT-3.5 significantly improved their performance, showing weaker models can imitate stronger ones (RQ3).
- Training only on successful demonstrations ("V-FT (Suc.)") yielded better results than training on all demonstrations ("V-FT (All)"), especially on the "Things" dataset (RQ4).
- Larger models (Vicuna-13B) also improved with BC, but the relative gain was smaller than for the 7B model. Improvements were less pronounced on the "Celebrities" dataset, suggesting imitation might be harder for tasks requiring more specific knowledge or strategy (RQ5).
Reinforcement Learning from Game-Play (RLGP):
- Using Proximal Policy Optimization (PPO) with rewards based on the game score (final reward) and receiving "Yes" answers (intermediate reward), Vicuna models were further improved.
- RLGP-trained models (V-RLGP) outperformed their BC-finetuned counterparts on the in-domain "Things" dataset, with V-RLGP 13B matching GPT-3.5 performance.
- Some generalization improvement was observed on the out-of-domain "Celebrities" dataset.
Breakdown Analysis: Models showed varying strengths on different entities. RLGP tended to improve success on items the base model already had some chance on, while BC was better at enabling success on entirely new items.

The paper concludes that the EDA benchmark effectively probes LLM planning and reasoning. State-of-the-art LLMs demonstrate these abilities to some extent, and techniques like Behavior Cloning and Reinforcement Learning can enhance these capabilities in open-source models. The code and datasets are released to facilitate future research.

PDF Markdown Bookmark Chat (Pro)

References (42)

Authors (3)

Yizhe Zhang (127 papers)
Jiarui Lu (31 papers)
Navdeep Jaitly (67 papers)

Citations (5)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - apple/ml-entity-deduction-arena (13 stars)

Tweets

https://twitter.com/YizheZhangNLP/status/1710843772618363233