GSM8K-Prolog-Prover Dataset

Updated 15 December 2025

GSM8K-Prolog-Prover is a dataset that enriches GSM8K math problems with executable SWI-Prolog code, enabling precise symbolic reasoning.
It employs a rigorous generation and verification pipeline including GPT-driven candidate production, automatic execution, and manual corrections.
Benchmark results demonstrate that Prolog-based execution significantly boosts math problem solving accuracy compared to chain-of-thought approaches.

The GSM8K-Prolog-Prover dataset is a corpus that enriches the GSM8K elementary math word problem benchmark with Prolog programs structured for automated reasoning and verifiable execution. Leveraging the strengths of symbolic logic programming, this dataset enables precise assessment of LLMs’ abilities to extract, represent, and solve quantitative reasoning problems in a formally executable and auditable framework (Yang et al., 2024, Mellgren et al., 8 Dec 2025, Yang et al., 2023).

1. Dataset Definition and Scope

GSM8K-Prolog-Prover consists of natural-language mathematical problems from the original GSM8K benchmark, each paired with a small SWI-Prolog program utilizing CLP(Q) (Constraint Logic Programming over rationals). Each reference Prolog solution is deterministically executable, yielding the known GSM8K answer via the SWI-Prolog interpreter. The creation and verification process ensures that every Prolog solution exactly replicates the intended numeric answer for each problem (Yang et al., 2024, Mellgren et al., 8 Dec 2025).

Two major versions have been documented:

Original GSM8K-Prolog: Released by Yang et al., featuring 8,792 examples (7,473 train, 100 validation, 1,319 test) (Yang et al., 2024).
GSM8K-Prolog-Prover (cleaned): Prepared by Mellgren et al., correcting 15 mismatches for 100% consistency across 7,473 examples (2,500 train, 375 validation, 1,320 test) (Mellgren et al., 8 Dec 2025).

2. Format and Prolog DSL Specification

Each dataset instance comprises the following fields, typically in JSON Lines (JSONL) format (Yang et al., 2024, Mellgren et al., 8 Dec 2025, Yang et al., 2023):

instruction: Prompt to generate Prolog for the problem.
input / question: The original math word problem.
output / prolog_code: The corresponding SWI-Prolog program under CLP(Q).
(Optionally) chain_of_thought and correct_answer.

The Prolog code adheres to a minimal, deterministic, and verifiable DSL. The canonical program structure contains:

Import of CLP(Q): :- use_module(library(clpq)).
Zero or more facts for problem constants, e.g., age(raymond, 23).
A single public predicate solve/1 binding the numeric answer.
All arithmetic and constraints are enclosed within { ... } blocks using CLP(Q) syntax, e.g., {Son_age = [R](https://www.emergentmind.com/topics/f-r-gravity-models) - D}.
All identifiers, punctuation, and variable conventions follow SWI-Prolog.

Example (abridged from (Yang et al., 2024)):

:- use_module(library(clpq)).
age_diff(raymond, samantha, 6).
age(raymond, 23).
age(samantha, 31).
solve(Years_ago) :-
  age_diff(raymond, samantha, D),
  age(raymond, R),
  age(samantha, S),
  {Son_age = R - D},
  {Years_ago = S - Son_age}.

The grammar is formally specified in minimal BNF (Yang et al., 2024, Mellgren et al., 8 Dec 2025).

3. Construction and Annotation Pipeline

The generation and verification process involves several stages (Yang et al., 2024, Mellgren et al., 8 Dec 2025, Yang et al., 2023):

Prolog Generation: GPT-4 or GPT-3.5 models are prompted few-shot (using hand-crafted and accumulated correct examples) to generate candidate Prolog for each GSM8K problem.
Automatic Verification: Each output program is executed via SWI-Prolog (PySwip or equivalent). Only exact matches to the GSM8K solution are retained automatically.
Manual Correction and Cleaning: Remaining failures (<5% in Yang et al.; specifically 15 mismatches in Mellgren et al.) are manually rectified for syntactic and semantic correctness. This guarantees interpreter-verified correctness for all entries.
Permutation Augmentation ("PROPER"): The declarative nature of Prolog permits data augmentation through permutation of clause and goal orders. For each original, up to $k_1 \times k_2 = 100$ variants are sampled, permuting both subgoals and clause orders (in practice, 10–20 on-the-fly). Experiments vary the ratio of permuted to original examples to optimize robust learning (Yang et al., 2024).

4. Dataset Composition, Statistics, and Accessibility

Version	Train	Valid	Test	Total	Consistency	Host/License
GSM8K-Prolog (Yang et al.)	7,473	100	1,319	8,792	Verified	HF / MIT
GSM8K-Prolog-Prover (clean)	2,500	375	1,320	7,473	100% CLP(Q)	HF / MIT

The released dataset is on HuggingFace: Thomas-X-Yang/gsm8k-prolog (Yang et al., 2024, Mellgren et al., 8 Dec 2025, Yang et al., 2023).
Licensing is MIT, inherited from the original GSM8K (Yang et al., 2024, Yang et al., 2023).

5. Usage, Execution, and Evaluation Protocol

Loading the dataset:

Using HuggingFace's datasets:

1
2
3

from datasets import load_dataset
ds = load_dataset("Thomas-X-Yang/gsm8k-prolog")
train = ds["train"]

Prolog execution (PySwip):

from pyswip import Prolog
prolog = Prolog()
prolog.assertz(":- use_module(library(clpq))")
for line in code.strip().splitlines():
    prolog.assertz(line)
result = list(prolog.query("solve(X)"))
print(result)  # e.g., [{'X': 14}]

Evaluation:

Execution accuracy is the primary metric: a generated program is considered correct iff its execution yields exactly the GSM8K gold answer. The formula:

$\text{Accuracy} = \frac{\#\text{ correct}}{|D_{\text{test}}|} \times 100\%$

Additional protocols in recent works introduce structural validity, semantic similarity (SBERT-based), and static code checks (Mellgren et al., 8 Dec 2025).
Models are compared using standard supervised finetuning and reinforcement learning strategies (e.g., GRPO), with variations in reward composition and inference protocols. Empirically, models fine-tuned to produce correct Prolog under verifiable execution significantly outperform standard chain-of-thought prompting on GSM8K (Yang et al., 2024, Mellgren et al., 8 Dec 2025, Yang et al., 2023).

6. Empirical Results and Benchmarking

Experiments across Llama-2, CodeLlama, Mistral, and Qwen2.5 demonstrate that Prolog-based program induction and verification outperform chain-of-thought on GSM8K and GSM-HARD:

Model	CoT (%)	Prolog (%)	PROPER (%)
Llama-2-7B	37.5	55.0	59.0
CodeLlama-7B	58.9	66.3	70.2
Mistral-7B	33.8	41.5	51.0

Best Prolog-based models using reinforcement learning with verifiable rewards can achieve up to ~80% execution accuracy on GSM8K, which approaches or surpasses larger models using standard text final-answer evaluation (Mellgren et al., 8 Dec 2025). Permutation augmentation (PROPER) further boosts both mean and best accuracy metrics (Yang et al., 2024).

7. Context, Applications, and Impact

GSM8K-Prolog-Prover operationalizes grade-school math word problem solving as a symbolic code generation and verification task. This enables:

Auditable, traceable solution derivations as verification steps are executed in a deterministic logic engine.
Rigorous evaluation of LLMs’ capacity to map natural language reasoning into executable symbolic representations.
Fine-tuning and RL protocols that leverage multi-granular reward signals, including execution, syntax, and semantic overlap, to shape model learning (Mellgren et al., 8 Dec 2025).

By tightly coupling LLM outputs with formal verification engines, GSM8K-Prolog-Prover advances reliable, safety-critical mathematical reasoning and fosters research into interpretable and auditable neuro-symbolic systems (Yang et al., 2024, Mellgren et al., 8 Dec 2025, Yang et al., 2023).