TurtleSoup-Bench: Imaginative Reasoning Benchmark
- TurtleSoup-Bench is a benchmark and evaluation framework that tests LLMs on dynamic, iterative hypothesis construction using Turtle Soup puzzles.
- It leverages a bilingual, culturally balanced puzzle dataset with detailed clue annotations to capture diverse reasoning genres and challenges.
- Experimental results show LLMs trail human experts by about 13 percentage points, highlighting the need for enhanced adaptive metacognitive strategies.
TurtleSoup-Bench is a benchmark and evaluation framework designed to systematically probe the imaginative reasoning capability of LLMs through interactive, information-sparse puzzles based on the classic "Turtle Soup" game. Unlike benchmarks focused on static deduction or purely factual question answering, TurtleSoup-Bench evaluates whether LLMs can proactively construct, refine, and test hypotheses by iteratively formulating questions and updating beliefs as new information is revealed (Zhou et al., 14 Aug 2025).
1. Conceptual Foundation and Objectives
TurtleSoup-Bench centers around Turtle Soup puzzles, which present a scenario with a mysterious final outcome ("soup surface"). The solver—human or LLM—must uncover the underlying story ("soup bottom") by sequentially posing yes/no/unknown questions, each designed to reduce uncertainty and steer inference. The benchmark explicitly disambiguates imaginative reasoning as the capacity for dynamic hypothesis construction, iterative testing, and strategic revision in environments where critical information is hidden or distributed. This is in contrast to static reasoning, retrieval, or adversarial social deduction, and targets the proactive aspects of cognition crucial for open-ended discovery processes.
The benchmark comprises 800 puzzles: 400 original Chinese puzzles selected from curated online sources and experts, professionally translated into English, and meticulously adapted to balance cultural equivalence. Each puzzle is annotated with a Key Clue Library—a structured listing of information signals that are essential for unraveling the solution. Puzzle genres span Crime Thriller, Mind Game, Supernatural Fantasy, Constant Change, Clever Logic, and Original designs, ensuring coverage across diverse reasoning templates.
2. Benchmark Structure and Data Annotation
The data structure of TurtleSoup-Bench is distinctive in its comprehensiveness and bilingual pairings. Each of the 800 puzzles is stored with the following:
- Soup Surface (puzzle statement)
- Soup Bottom (full, underlying narrative)
- Key Clue Library (set of pivotal clues)
- Genre Tag (categorizing narrative/reasoning style)
The annotation of key clues is essential for targeted evaluation: clues are classified to indicate their criticality both for logical coherence and for detail completion during the inference process. The genre tagging enables cross-sectional analysis of model strengths and weaknesses, as different genres may encourage different forms of reasoning (e.g., causal construction, metaphorical leaps, rule-based logic).
3. The Mosaic-Agent Framework
The Mosaic-Agent is an agent-based simulation scaffold designed to mediate interactive puzzle solving and automate nuanced evaluation. It comprises:
- Questioner Agent: Generates candidate questions at each turn, informed by local and global puzzle context, recent answer history, extracted key clues, and a dynamically inferred genre.
- Deliberation Agent: Maintains and updates a Belief State, denoted as , summarizing logical understanding (), key detail aggregation (), and currently hypothesized solution ().
- Meta-Cognition Agent: Employs a smoothed confidence function, , with and switch threshold , to adjust genre inference and reasoning mode during multi-turn interaction.
- Action Formulation Agent: Constructs a candidate question pool from analysis and proposal sets (), selects for optimal information gain and minimal redundancy.
- Responder Agent: Simulates an oracle whose responses are restricted to “Yes”, “No”, or “Unknown”, optionally flagging the delivery of a key clue based on the question and the solution structure ().
- Memory Module: Stores complete interaction chains, cumulative clue extraction, and current agent states, supporting consistency and informed reasoning over multiple rounds.
4. Evaluation Protocols and Metrics
The evaluation in TurtleSoup-Bench is multi-dimensional, spanning both the reasoning process and the final solution quality. The scoring protocol is:
- Logic Accuracy (): Assesses whether the agent consistently integrates clues, avoids contradiction, and maintains a coherent narrative trajectory.
- Detail Fidelity (): Measures the completeness with which key factual signals are recovered and synthesized in the agent’s belief state.
- Conclusion Match (): Evaluates the final summary’s semantic alignment with the soup bottom.
The overall score is a weighted sum:
with typical weights , , .
Semantic matching of logic points and details uses heuristics to calibrate the number of scored elements to candidate response length, enabling robust adaptation across puzzles of varying complexity.
5. Experimental Results and Model Capabilities
TurtleSoup-Bench experiments with leading LLMs (claude-3.7-sonnet, gemini-2.5-flash, deepseek-r1, gpt-4o, qwen3-32b, llama3-8b-instruct) demonstrate several capability gaps and failure modes:
- The top-performing LLMs remain approximately 13 percentage points behind human experts in overall score.
- Performance is non-uniform across genres; some models excel in "Constant Change," while others falter in "Clever Logic," suggesting that imaginative reasoning is comprised of differentiated subskills rather than a unified capacity.
- Qualitative error patterns include Semantic Fixation (literal inclination toward high-frequency terms missing symbolic cues), Context Construction Failures (fragmented synthesis of clues), Logic Blind Spots (inability to propose novel hypotheses breaking habitual patterns), and Deductive Pruning Failures (inadequate exclusion of already falsified hypotheses).
A plausible implication is that high-fidelity pattern inference is insufficient for dynamic, imaginative reasoning, and that LLMs require more intricate compositional and metacognitive mechanisms to approach human-level performance.
6. Context and Future Directions
TurtleSoup-Bench posits a new standard for evaluating and training LLM exploratory cognition. By establishing a pipeline in which agents must autonomously drive inquiry, synthesize multi-modal clues, and iteratively refine hypotheses, the framework supplies a granular lens onto reasoning strategies, failures, and genre-based variability.
Prospective advances might include targeted architectural enhancements for adaptive metacognition, incorporation of explicit memory and context consolidation routines, and systematic genre-based curriculum fine-tuning. An additional plausible direction is hybridization with symbolic systems that can more rigourously maintain and prune inference trees given interactive feedback. Expanding the diversity and complexity of puzzles, further refining evaluation heuristics, and benchmarking against more interactive human protocols could sharpen discriminative insights into model strengths and weaknesses.
TurtleSoup-Bench and the Mosaic-Agent framework instantiate a rigorous foundation for research into deep, exploratory, and imaginative reasoning within LLMs, delineating the boundaries between static pattern matching and dynamic cognitive synthesis (Zhou et al., 14 Aug 2025).