This paper introduces the Entity-Deduction Arena (EDA), a framework based on the 20 Questions (Q20) game, to evaluate the multi-turn conversational planning and reasoning capabilities of LLMs. The motivation stems from the observation that while LLMs excel at answering clear questions, they struggle with ambiguity and often fail to ask clarifying questions, a crucial skill for intelligent agents in real-world scenarios like task completion or conversational search.
The EDA framework involves two players: a "judge" (J) who knows a secret entity and answers questions with "Yes," "No," or "Maybe" (or "Dunno" for the Celebrities dataset), and a "guesser" (G), the LLM being evaluated, which must deduce the entity by asking a series of questions in as few turns as possible (up to 20). Success requires the guesser to perform state tracking (understanding context, history), strategic planning (asking efficient, non-redundant questions), and inductive reasoning (forming hypotheses based on answers).
Implementation Details:
- Judge: GPT-3.5-turbo is used as the judge, prompted to only provide constrained answers based on its knowledge of the entity. A low temperature (0.2) is used for more deterministic responses. Judge error rates were found to be low (~3%).
- Guesser: Various LLMs are evaluated as the guesser, using prompts instructing them to deduce the entity via questions. A higher temperature (0.8) is used for diverse outputs.
- Datasets: Two datasets were created: "Things" (500 common objects/concepts) and "Celebrities" (500 names across nationalities, eras, occupations). Each is split into train/eval/test sets.
- Evaluation Metrics:
#Turns
: Average game length (lower is better).Success rate
: Percentage of games won (exact match on final guess).#Yes
: Average number of "Yes" answers received (indicator of question quality).Score
: A combined metric rewarding success and penalizing turns beyond 5: if won, else 0.
Benchmarking Results:
- Several LLMs (GPT-4, GPT-3.5, Claude-1/2, Vicuna-7B/13B, Mistral-7B) were benchmarked.
- GPT-4 significantly outperformed other models and human players on both datasets.
- Stronger models generally showed better planning (fewer redundant questions, better partitioning of possibilities) and reasoning (avoiding inconsistent questions/guesses).
- Weaker models often failed due to "Early Enumeration," "Redundancy," or "Inconsistency."
- Open-source models like Vicuna-13B and Mistral-7B showed promise, sometimes outperforming older closed models like Claude-1.
Analysis and Enhancement:
- Strategy Probing (RQ1): By prompting the guesser for its top-5 entity candidates at each turn, the paper found that strong models like GPT-4 seem to maintain an internal state/taxonomy, asking questions to efficiently partition the likely candidates. They can also backtrack when uncertain or when realizing a category was overlooked.
- Planning vs. Reasoning (RQ2): An experiment swapping models only for the final guess showed that both strong planning (generating an informative game trajectory) and strong reasoning (making the final correct guess based on the trajectory) are crucial and synergistic. Poor planning makes the final reasoning step very difficult, even for a strong model.
- Behavior Cloning (BC) (RQ3-5):
- Fine-tuning Vicuna models on game demonstrations generated by GPT-3.5 significantly improved their performance, showing weaker models can imitate stronger ones (RQ3).
- Training only on successful demonstrations ("V-FT (Suc.)") yielded better results than training on all demonstrations ("V-FT (All)"), especially on the "Things" dataset (RQ4).
- Larger models (Vicuna-13B) also improved with BC, but the relative gain was smaller than for the 7B model. Improvements were less pronounced on the "Celebrities" dataset, suggesting imitation might be harder for tasks requiring more specific knowledge or strategy (RQ5).
- Reinforcement Learning from Game-Play (RLGP):
- Using Proximal Policy Optimization (PPO) with rewards based on the game score (final reward) and receiving "Yes" answers (intermediate reward), Vicuna models were further improved.
- RLGP-trained models (V-RLGP) outperformed their BC-finetuned counterparts on the in-domain "Things" dataset, with V-RLGP 13B matching GPT-3.5 performance.
- Some generalization improvement was observed on the out-of-domain "Celebrities" dataset.
- Breakdown Analysis: Models showed varying strengths on different entities. RLGP tended to improve success on items the base model already had some chance on, while BC was better at enabling success on entirely new items.
The paper concludes that the EDA benchmark effectively probes LLM planning and reasoning. State-of-the-art LLMs demonstrate these abilities to some extent, and techniques like Behavior Cloning and Reinforcement Learning can enhance these capabilities in open-source models. The code and datasets are released to facilitate future research.