Knights and Knaves Puzzle Dataset

Updated 13 October 2025

Knights and Knaves Logic Puzzle Dataset is a structured collection of puzzles that formalize agents who always tell the truth or lie, enabling rigorous reasoning tests.
The dataset is constructed via manual curation, semi-automated and synthetic generation, and formal encoding to ensure unique solution validation and scalable complexity.
It underpins evaluation of computational solvers and large language models by supporting benchmarking tasks like theorem proving, chain-of-thought analysis, and reinforcement learning.

A Knights and Knaves Logic Puzzle Dataset constitutes a curated or algorithmically generated collection of logic puzzles predicated on the classic genre introduced by Raymond Smullyan, where agents—typically denoted knights and knaves—are distinguished solely by their response behavior: knights always tell the truth; knaves always lie. Recent years have seen the systematic construction, annotation, and utilization of such datasets for formal reasoning evaluation, natural language understanding, and machine learning benchmarks, including synthetic generation at scale and formal encoding into logical frameworks. The dataset paradigm has been extended with hybrid agent types, computational solving systems, and deployment in reinforcement learning pipelines.

1. Classical Knights and Knaves Formalism and Puzzle Encoding

The prototypical dataset instance consists of n agents, each characterized as either knight or knave. In the canonical logical encoding, the background knowledge is given by two universal axioms: $\forall x\ (\text{inhabitant}(x) \rightarrow \text{knight}(x) \vee \text{knave}(x))$

$\forall x\ (\text{knight}(x) \leftrightarrow \neg \text{knave}(x))$

Puzzle-specific statements are then formalized as constraints relating agent types to utterances (e.g., "Sue claims that Rex is a knight and Dave is a knave" becomes $m(\text{Sue}) \leftrightarrow (\text{knight}(\text{Rex}) \wedge \text{knave}(\text{Dave}))$ ). Each instance includes the puzzle text, its logical formalization, and target questions of the form "Is X a knight?" which are then classified as entailment, contradiction, or ambiguity depending on the solution space (Szomiu et al., 2021).

Automated theorem proving tools (e.g., Mace4, Prover9) are utilized to evaluate puzzle consistency, infer answers to atomic queries, and systematically check for uniqueness or ambiguity in solution (Szomiu et al., 2021, Groza et al., 2021). This formal pipeline underpins both human and machine comprehension benchmarks.

2. Dataset Construction: Sources, Scaling, and Synthesis

Datasets originate from manual curation (e.g., collections of natural language logic puzzles), systematic encoding (human or semi-automated FOL translation), and—more recently—synthetic generation via algorithmic logic engines or Satisfiability Modulo Theories (SMT) (Xiong et al., 21 Aug 2025). For example, one approach develops 2,400 unique TruthQuest puzzles by combinatorial variation in statement forms, number of agents, and logical complexity (Mondorf et al., 18 Jun 2024). Diversity is further enhanced by the synthetic generation of agent types and logical constraints, as well as scalable randomization over puzzle parameters (Xiong et al., 21 Aug 2025). Importance is placed on verifiability: all instances must yield programmatically checkable solutions (unique or controlled ambiguity), with fidelity to the underlying logical specification.

Broadened datasets also incorporate puzzles with generalized agent types:

Spies (may always lie or respond arbitrarily) (Wildon, 2014)
Normals (sometimes tell the truth, sometimes lie) (Rakshit et al., 2023)
Alternators, chameleons, oscillators (behaviors determined by past interactions or internal state) (Khovanova, 2018, Khovanova, 2018) These variants expand the logical state space and require richer formal encodings.

3. Logical Frameworks and Benchmark Annotation

Each puzzle is annotated not only with natural language text and solutions but also with its formal logical structure. The logical framework integrates:

Universal background axioms (see above)
Puzzle-specific constraints
Explicit generation and labeling of atomic questions:
- Entailment: statement is always true in every model
- Contradiction: always false
- Ambiguity: not determined (multiple models)

This formal structure allows the dataset to serve as a benchmark for natural language inference, supporting both symbolic and neural reasoners. Additionally, difficulty metrics (e.g., number of agents, solution space structure) and logical complexity measures (depth of nesting, types of operators: conjunction, implication, biconditional, etc.) are annotated to facilitate stratified analysis (Szomiu et al., 2021, Mondorf et al., 18 Jun 2024).

4. Computational Solving Algorithms and Evaluation

Datasets serve as ground truths for the development and benchmarking of logic solvers. Algorithmic approaches range from combinatorial truth table methods to knowledge base construction using symbolic logic operators (And, Or, Not, Implication) (Rakshit et al., 2023). Formally, the solving task can be cast as optimally evaluating all consistent assignments subject to background and puzzle constraints, iterating through possible agent type assignments and checking model satisfaction.

Python-based frameworks are commonly utilized to parse puzzle statements, construct logical knowledge bases, and deduce solutions. For puzzles with ambiguous outcomes, only the consistent parts of the solutions are retained (Rakshit et al., 2023). Performance is measured by correctness against annotated solutions and by computational efficiency (number of inferences, scaling with n).

In datasets annotated for ML evaluation, metrics such as overall accuracy, parsing coverage, and robustness to puzzle complexity are reported (Groza et al., 2021). For instance, one system achieves an 80.89% accuracy on a set of 382 canonical puzzles, with parsing and coreference modules providing further analytic breakdowns.

5. Benchmarking LLMs and Reasoning Systems

Recent work leverages Knights and Knaves datasets as rigorous testbeds for systematic reasoning evaluation of LLMs. The TruthQuest benchmark (Mondorf et al., 18 Jun 2024) assesses LLM performance on 2,400 unique puzzles with varying complexity, analyzing error types (conceptual, logical operator, failure to propagate hypothetical truth and lies). Evaluation protocols incorporate zero-shot, chain-of-thought, and few-shot prompting.

Empirical results indicate persistent difficulties among LLMs—particularly as puzzle size and complexity increase. Error analysis partitions failures into misunderstanding truth/lie mechanisms (TL errors), improper logical operator application (LO errors), and failures in suppositional reasoning (conditional propagation) (Mondorf et al., 18 Jun 2024). This positions Knights and Knaves datasets as platforms for both diagnosing and advancing model reasoning.

6. Datasets for Training, Transfer, and Reinforcement Learning

Knights and Knaves datasets are not merely evaluation sets; they form the foundation for curriculum learning, chain-of-thought distillation, and reinforcement learning pipelines. For example, Logic-RL (Xie et al., 20 Feb 2025) trains a 7B model on 5,000 synthetic Knights and Knaves puzzles, utilizing structured system prompts and dual reward signals (tagged chain-of-thought and answer correctness). The procedure yields advanced reasoning skills—reflection, verification, exploration—and enables transfer to harder mathematical reasoning benchmarks (e.g., absolute percentage improvement on AIME/AMC).

A two-stage training strategy (Shrestha et al., 19 May 2025) leverages Knights and Knaves "warm-up" before RLVR, showing enhanced sample efficiency and cross-domain generalizability. MeRF (Zhang et al., 23 Jun 2025) further augments RLVR by direct injection of a natural language motivation (description of the reward function) into the prompt, showing that models align internal reasoning with optimization objectives when in-context "game rules" are provided.

7. Extensions and Open Problems

The logical structure and generalizability of Knights and Knaves datasets have motivated research into scalable synthesis (e.g., PuzzleClone SMT-based framework (Xiong et al., 21 Aug 2025), details not available), combinatorial optimization of questioning strategies (majority/minority games (Wildon, 2014)), and the elaboration of agent types (guilt-responsive liars (Chen et al., 2016), alternators and agents with switching belief systems (Khovanova, 2018)). Many open questions remain concerning optimal questioning algorithms, minimal query strategies, and the formal characterization of ambiguous or indeterminate instances (Wildon, 2014). These extensions drive the continued expansion and theoretical depth of Knights and Knaves datasets as both logical testbeds and foundational reasoning resources.

In conclusion, a Knights and Knaves Logic Puzzle Dataset provides a structured, formal, and extensible corpus of reasoning challenges, bridging classical logic, natural language semantics, algorithmic deduction, and large-scale model evaluation and training. Its continued development supports advances in symbolic reasoning, explainable AI, reinforcement learning, and general problem-solving methodology.