Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sudoku-Bench: Evaluating creative reasoning with Sudoku variants (2505.16135v1)

Published 22 May 2025 in cs.AI, cs.CL, and cs.LG

Abstract: Existing reasoning benchmarks for LLMs frequently fail to capture authentic creativity, often rewarding memorization of previously observed patterns. We address this shortcoming with Sudoku-Bench, a curated benchmark of challenging and unconventional Sudoku variants specifically selected to evaluate creative, multi-step logical reasoning. Sudoku variants form an unusually effective domain for reasoning research: each puzzle introduces unique or subtly interacting constraints, making memorization infeasible and requiring solvers to identify novel logical breakthroughs (``break-ins''). Despite their diversity, Sudoku variants maintain a common and compact structure, enabling clear and consistent evaluation. Sudoku-Bench includes a carefully chosen puzzle set, a standardized text-based puzzle representation, and flexible tools compatible with thousands of publicly available puzzles -- making it easy to extend into a general research environment. Baseline experiments show that state-of-the-art LLMs solve fewer than 15\% of puzzles unaided, highlighting significant opportunities to advance long-horizon, strategic reasoning capabilities.

Summary

  • The paper demonstrates that state-of-the-art LLMs struggle with creative reasoning, achieving solve rates below 15% on larger Sudoku variants.
  • The benchmark employs a standardized text-based puzzle format across 4x4, 6x6, and 9x9 grids to isolate logical challenges from visual processing.
  • The study provides expert reasoning traces and tool integration insights, offering actionable pathways for enhancing multi-step, human-like problem solving in AI.

LLMs currently struggle with genuinely creative reasoning, often succeeding on benchmarks by memorizing patterns rather than demonstrating novel problem-solving. The paper "Sudoku-Bench: Evaluating creative reasoning with Sudoku variants" (2505.16135) introduces a benchmark designed to address this limitation by using challenging and unconventional Sudoku variants.

Sudoku variants are highlighted as a particularly effective domain for evaluating creative reasoning because each puzzle introduces unique or subtly interacting constraints. This design makes memorization infeasible and requires solvers to identify novel logical steps or "break-ins" specific to that puzzle. Despite their diversity, these variants maintain a compact and consistent structure (n×nn \times n grid with natural language rules and visual elements), which allows for clear and standardized evaluation.

The Sudoku-Bench benchmark consists of a carefully curated set of 100 puzzles: 15 4×44 \times 4, 15 6×66 \times 6, and 70 9×99 \times 9 variants, selected to span a range of difficulties and reasoning styles. This includes puzzles curated with expert human solvers to require specific creative "break-ins." The benchmark also incorporates 20 difficult vanilla Sudoku puzzles to provide a baseline against a more familiar domain.

A key practical aspect of Sudoku-Bench is its standardized text-based puzzle representation. Each puzzle's rules, grid size, initial state, and visual elements are described purely in text (e.g., paths defined by coordinates, dots described by adjacent cells). This text format is designed to isolate the logical reasoning challenge from visual processing, making it suitable for evaluation with current LLMs, which often struggle with precisely interpreting visual grid elements in complex puzzles. The paper provides code to extract these text descriptions from puzzles specified in SudokuPad, enabling researchers to apply the harness to other puzzles.

Alongside the benchmark, the authors release tools for interacting with the puzzle application SudokuPad, offering an agentic environment. While the primary evaluation in the paper uses text interaction, these tools allow models to use human-like annotation strategies within SudokuPad, like color-coding cells or adding pencil marks. This provides flexibility for future research exploring models that can leverage such tools.

A significant resource released with Sudoku-Bench is a dataset of expert reasoning traces from the YouTube channel Cracking the Cryptic. This dataset includes audio transcripts of human solvers explaining their step-by-step reasoning process, paired with extracted SudokuPad actions taken during the solve. With over 3,000 videos and thousands of hours of content, this dataset offers rich, detailed human reasoning examples, providing a valuable resource for researchers interested in supervised learning or imitation learning to train models to reason in a more "human-like" manner.

The benchmark evaluates models in two configurations:

  1. Single-shot: The model receives the puzzle description and is prompted to output the complete solved grid in a single response. The metric is the solve rate (percentage of puzzles correctly solved).
  2. Multi-step: The model is prompted to analyze the board and provide at least one valid digit placement per turn. The user provides the updated board state. This continues until the puzzle is solved or the model makes an incorrect placement. Metrics are the solve rate and the average number of correct digit placements before an error. Context is managed by keeping recent model responses plus the initial puzzle prompt.

Baseline experiments with state-of-the-art LLMs demonstrate that Sudoku-Bench poses a significant challenge. Even leading models achieved solve rates below 15% on the full benchmark without tool assistance. Performance dropped dramatically with increasing puzzle size and complexity: models achieved reasonable solve rates on 4×44 \times 4 puzzles (40-73%) but struggled significantly with 6×66 \times 6 and 9×99 \times 9 variants, with performance near 0% for many models on the largest grids. The minimal difference between single-shot and multi-step performance on larger puzzles suggests the core difficulty lies in identifying initial logical breakthroughs ("break-ins"), not just incremental deduction.

Analysis of model failures revealed common patterns: models frequently produce confidently incorrect solutions, explicitly surrender, claim missing information (often mistaking novel rules for incomplete puzzles), or mistakenly identify contradictions in valid rules. The "Missing Information" failure mode is particularly notable, suggesting models trained primarily on standard data struggle to interpret the novel constraints common in variants.

The paper illustrates the challenge with examples like "Ascension," a 9×99 \times 9 puzzle with minimal givens that requires chaining deductions across unusual constraints (knight's moves and arrows). Human solvers identify a subtle, multi-step logical sequence to place the first digits (a "break-in"). Tested LLMs failed to find this logical entry point and resorted to ineffective brute-force attempts. Conversely, an example like "Sumthings" (a 6×66 \times 6 puzzle) shows that models can succeed by narrowing the search space and using search, a strategy less effective on puzzles requiring specific break-ins.

The authors discuss the role of tool use, noting that while external solvers could solve some puzzles, this would bypass the core creative reasoning challenge the benchmark aims to evaluate. Thus, the primary evaluation focuses on no-tool use, assessing intrinsic reasoning. Future work could explore a separate track with tool use on a different set of puzzles.

In summary, Sudoku-Bench provides a structured, challenging benchmark and associated tools for evaluating and advancing AI's capacity for creative, multi-step logical reasoning beyond memorization. The significant gap in performance between current LLMs and human solvers on this benchmark highlights ample opportunity for future research, potentially leveraging the provided expert reasoning traces for training models that exhibit more human-like problem-solving strategies.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com