PROPER: Prolog Generation & Permutation Augmentation
- The paper introduces a framework that generates symbolic Prolog code from natural language math problems, using LLMs to bridge neural and symbolic reasoning.
- It exploits the permutation invariance of Prolog facts and rule bodies to augment training data, thereby improving model robustness and mitigating fixed output biases.
- Empirical results demonstrate significant accuracy improvements over chain-of-thought methods on benchmarks like GSM8K and GSM-HARD using fine-tuned Llama-2, CodeLlama, and Mistral models.
Prolog Generation with Permutation-Based Data Augmentation (PROPER) is a methodology for arithmetic reasoning using LLMs, in which symbolic Prolog program generation and permutation-based data augmentation are leveraged to improve accuracy and robustness in mathematical problem solving. The core insight is twofold: LLMs can be tasked with extracting and generating semantic Prolog predicates from natural language math questions, and the inherent permutation invariance of Prolog facts and rule bodies can be systematically exploited for data augmentation (Yang et al., 2024).
1. End-to-End Pipeline Architecture
The PROPER framework decomposes the solution process into three sequential stages:
(i) Natural-language to Prolog extraction:
An LLM receives a grade-school word problem in English and produces a corresponding set of Prolog predicates. These predicates include both factual assertions and a solve(...) :- ... rule, collectively encoding the arithmetic relationships and the target quantity.
Example conversion:
- Input: “Raymond and Samantha are cousins. Raymond was born 6 years before Samantha... If Samantha is now 31, how many years ago was Raymond’s son born?”
- Output (ground truth Prolog): 9
(ii) LLM code generation:
Models such as Llama-2-7B, CodeLlama-7B, and Mistral-7B are fine-tuned or prompt-tuned to emit valid Prolog code from the given problem statement. During inference, beam search (beam size = 4) is used for candidate generation.
(iii) External Prolog execution:
The generated Prolog code is evaluated using PySwip, a SWI-Prolog interface for Python. The program is loaded, queried via ?- solve(X)., and the concrete numeric answer is deterministically extracted.
2. Prompt Engineering Strategies
PROPER employs structured prompt designs, each with a concise instruction and input field. Three training prompt regimes are central:
- Chain-of-Thought (CoT) baseline: Requests a detailed stepwise answer terminating with the final answer. 0
- Prolog generation: Instructs the model to output Prolog code that solves the question. 1
- Permuted-Prolog generation: Instructs generation of Prolog code in an arbitrary, non-sequential order. 2
Each regime appends 20 few-shot examples formatted with the same template. For the permuted regime, both the facts and goals are randomly shuffled, and the "non-sequential" instruction clarifies that output order is not fixed.
3. Permutation-Based Data Augmentation (PROPER)
PROPER exploits the order-invariance (permutation invariance) of Prolog facts and rule bodies to augment the training data:
- For each original sample, let denote the set of facts/rules and the ordered list of goals in the
solveclause. - Up to 10 random permutations of and, independently, up to 10 permutations of are sampled, yielding a maximum of 100 permuted program variants per sample.
- During training, each original is mixed with distinct permutations, where is determined by a permutation ratio ; e.g., (one permuted per original) and (two permuted per original).
Pseudocode sketch:
3 Empirical studies explored .
This suggests that permutation-based augmentation forces the model to become robust to predicate order, enhancing generalization and disallowing reliance on fixed generation patterns.
4. Evaluation Criteria and Training Loss
PROPER is evaluated with exact-match accuracy, measuring if the numeric answer produced by the generated Prolog code matches the ground truth when executed in a Prolog interpreter.
Let 0, where 1 is the gold Prolog program and 2 the model’s output. Programs are executed with interpreter 3, yielding 4. Exact-match accuracy is: 5
During fine-tuning, the objective is token-wise cross-entropy: 6 where 7 is the input and 8 is the target Prolog code sequence.
5. Empirical Results on Arithmetic Benchmarks
Performance was benchmarked on GSM8K and GSM-HARD across Llama-2-7B, CodeLlama-7B, and Mistral-7B. The table below summarizes exact-match accuracy:
| Method | Llama-2 (GSM8K) | CodeLlama (GSM8K) | Mistral (GSM8K) | Llama-2 (GSM-HARD) | CodeLlama (GSM-HARD) | Mistral (GSM-HARD) |
|---|---|---|---|---|---|---|
| CoT | 33.8% | 37.5% | 58.9% | 12.0% | 13.9% | 30.8% |
| Prolog | 41.5% | 55.0% | 66.3% | 32.4% | 41.6% | 50.6% |
| PROPER (1:1) | 50.9% | 58.7% | 70.2% | 37.4% | 45.9% | 54.4% |
| PROPER (1:2) | 51.0% | 59.0% | 68.8% | 37.4% | 45.9% | 54.4% |
Key empirical findings:
- Prolog program generation consistently surpasses Chain-of-Thought by 8–22 percentage points.
- Incorporating one permuted variant per original increases GSM8K accuracy by +9.5 points for Llama-2 and +4.0 for CodeLlama.
- A second permutation (1:2) provides marginal gains or slight reductions, depending on model architecture.
6. Dataset Construction and Generation Workflows
Datasets were constructed as follows:
GSM8K-Prolog Generation (Algorithm 1):
- Start with GSM8K QA pairs and a Prolog interpreter.
- Manually create 10 canonical few-shot Prolog solutions.
- For each QA pair, prompt GPT-4 with the few-shot seeds to draft Prolog code.
- Accept samples where the Prolog program’s output matches the chain-of-thought answer; otherwise iteratively re-prompt or manually correct.
- Repeat augmentation until all items are processed or the time budget is exhausted.
Permutation Procedure:
For each code sample:
- Generate up to 10 random permutations of goals and facts.
- For each goals–facts combination, assemble and store as an augmented variant (capped at 100 per sample).
- During training, randomly sample a specified number of permutations per original, controlled by the permutation ratio.
7. Significance, Intuition, and Implications
PROPER demonstrates that LLMs, when trained to generate symbolic Prolog predicates from natural language descriptions, circumvent the issue of cascading arithmetic errors which often affect stepwise Chain-of-Thought approaches. Offloading the execution of arithmetic to a deterministic Prolog interpreter ensures robustness and correctness in numeric computation. The permutation-based augmentation, unique to the syntactic and semantic properties of logic programming, induces a robust generative process in the LLM—a plausible implication is that this approach encourages the model to semantically internalize mathematical relationships rather than overfit to canonical output forms. These findings establish PROPER as a principled bridge between neural LLMs and symbolic deductive reasoning for arithmetic problem-solving (Yang et al., 2024).