Sudoku-Bench: Creative Reasoning Evaluation
- Sudoku-Bench Benchmark is a curated suite of unconventional Sudoku variants that test creative, multi-step deductive reasoning in AI models.
- It uses a standardized text-based puzzle format and integrated tools to isolate logical inference from memorization and perceptual errors.
- Baseline evaluations show leading LLMs solve less than 15% of puzzles, highlighting a critical gap in creative problem-solving capabilities.
The Sudoku-Bench Benchmark is a curated, extensible evaluation suite designed to assess creative, multi-step reasoning abilities—particularly in LLMs—through a carefully selected set of unconventional Sudoku variants. Each puzzle in the benchmark introduces novel, interacting logical constraints that prohibit solution by memorization and require solvers to discover original deductive "break-ins." The benchmark includes a standardized, text-based puzzle format, integrations with open-source tooling, and comprehensive human solution trace data, supporting both rigorous model evaluation and future research in systematic reasoning.
1. Motivation and Aims
Sudoku-Bench was created to address persistent limitations in existing reasoning benchmarks, which often reward model memorization over genuine reasoning and provide limited challenge in creative deduction. While prior benchmarks such as MATH and GSM8K allow high performance via rote solution pattern recall, and ARC emphasizes non-memorizability but remains accessible to most human solvers, Sudoku-Bench targets the full spectrum of creative reasoning. It does so through the domain of Sudoku variants, where each puzzle's unique or subtly interacting rules require bespoke, multi-stage logical inferences. As new user-generated Sudoku variants are published daily on platforms like Logic Masters Germany, Sudoku-Bench has a renewable supply of conceptually diverse problems, facilitating scalability and ongoing relevance.
2. Benchmark Structure and Puzzle Selection
The core of Sudoku-Bench consists of 100 puzzles, selected and categorized for breadth of complexity and reasoning challenge:
- 15 easy puzzles, enabling progress tracking for smaller or less capable models
- 15 intermediate puzzles, requiring more structured thought
- 70 challenging puzzles, comprising 50 curated variants (including collaborations with expert solvers such as Cracking the Cryptic) and 20 difficult vanilla Sudokus sourced from Nikoli
Variant types within the benchmark include Killer Sudoku, Thermometers, Arrows, Kropki dots, meta-constraint puzzles, minimalist starting clues ("zero-clue" puzzles), and themed or whimsical constructions. Each puzzle was manually vetted for logical depth and requirement of creative "break-ins," defined as essential non-obvious deductions prerequisite to further progress. This breadth ensures evaluation spans from entry-level reasoning to expert-level creativity, and puzzles are selected to avoid susceptibility to any memorization or template-based solution.
3. Text-Based Representation and Interface
Sudoku-Bench utilizes a standardized, language-based puzzle specification:
- Rules are presented as explicit natural language statements, not as images or grid drawings.
- All constraints (e.g., cage sums, arrows, relationship dots) are described using coordinate-based notation.
- Puzzle size, starting digits, and any custom rules are made explicit.
This design isolates logical reasoning evaluation from computer vision, OCR, or perceptual parsing, ensuring that performance metrics reflect deductive reasoning and not input ambiguity or visual interpretation errors. Extraction tools are provided to convert puzzles from visual applications (such as SudokuPad) into this format, enhancing compatibility and reproducibility.
4. Baseline Model Evaluation
The published results demonstrate that state-of-the-art LLMs—including Gemini 2.5 Pro and other leading models—solve fewer than 15% of puzzles overall when unaided. Multi-step and single-shot modes were benchmarked:
- puzzles are commonly solved (e.g., Gemini 2.5 Pro achieves 73.3% in multi-step), but performance drops to zero on and puzzles.
- Iterative feedback and stepwise evaluation do not significantly improve model performance at higher difficulties.
- Models frequently fail either by proposing incorrect solutions, giving up, incorrectly declaring a contradiction, or omitting reasoning traces.
- The initial creative break-in often represents the largest barrier—models are rarely able to deduce it without explicit guidance.
This suggests that existing LLMs lack not only brute-force search depth but also creative deductive capacity for puzzles that defy rote pattern matching.
5. Innovative Features and Resources
Sudoku-Bench introduces several unique contributions:
- Puzzles are selected to require or reward creative reasoning, with many designed so that a unique break-in is both necessary and sufficient for progress.
- The standardized format and associated extraction tools support adaptation to essentially any publicly available logical puzzle, greatly facilitating large-scale evaluation and dataset augmentation.
- Human solution traces are provided in the form of action logs and synchronized reasoning audio from expert solvers (e.g., Cracking the Cryptic), permitting step-level evaluation, imitation learning research, and curriculum design.
- The benchmark and tools are fully open-source, with code, data, and solution traces made available via repositories and data hubs.
6. Impact and Future Directions
Sudoku-Bench foregrounds a key limitation in current reasoning systems: inability to discover and execute creative, long-horizon logical inferences when solving unfamiliar, rule-rich problems. The suite therefore encourages:
- The development and fine-tuning of agentic models capable of meta-reasoning (i.e., reasoning about which strategies or break-ins to attempt)
- Integration with formal tools, constraint solvers, or code execution environments, extending LLM performance via neuro-symbolic or tool-augmented architectures
- Research into multimodal model development (by eventually linking textual and visual puzzle representations)
- Granular error and performance analysis using trace data, guiding model architecture improvements and evaluation metrics
A plausible implication is that, as models begin to close the current creative reasoning gap identified by Sudoku-Bench, advances may transfer to applications involving unfamiliar or creative constraint satisfaction, including scientific hypothesis generation, real-world planning, and strategic theorem proving.
7. Tooling, Extensibility, and Research Environment
Sudoku-Bench provides a modular, research-oriented environment:
- Formatter utilities enable ingestion and text-based encoding of any SudokuPad or Logic Masters Germany puzzle.
- Integrations offer annotation and action-logging for both manual and agentic interactions.
- The design anticipates broader research into agentic reasoning, tool-usage strategies, and curriculum learning.
All resources, including the full puzzle suite, solution trace data, and codebase, are found at:
- https://github.com/SakanaAI/Sudoku-Bench
- https://huggingface.co/datasets/SakanaAI/Sudoku-CTC-Reasoning
Sudoku-Bench thus serves as both a rigorous quantitative benchmark and a flexible foundation for qualitative research, system development, and the paper of creative symbolic reasoning in artificial intelligence systems.