SLR-Bench: Automated ILP Benchmarking
- SLR-Bench is a benchmark for automated inductive logic programming tasks that employs a contamination-free, fully synthetic data generation process.
- It utilizes the LogicBencher pipeline to synthesize tasks through rule formalism, background fact generation, and deterministic Prolog-based validation.
- Empirical results show that logic-tuning with SLR-Bench notably enhances LLMs’ logical reasoning ability while exposing trade-offs in compute efficiency.
SLR-Bench is a systematically generated benchmark for evaluating and training LLMs on fully automated inductive logic programming (ILP) tasks with precisely controlled difficulty. Constructed via the SLR (Scalable Logical Reasoning) framework’s end-to-end LogicBencher pipeline, SLR-Bench comprises over 19,000 prompts distributed across 20 curriculum levels, each designed to probe progressively complex forms of relational, arithmetic, and recursive reasoning in LLMs. SLR-Bench is contamination-free by construction, requires no human annotation, and supports both zero-shot evaluation and systematic training of LLMs to advance their logical reasoning capabilities (Helff et al., 18 Jun 2025).
1. Generation Pipeline: Task Synthesis and Rule Formalism
SLR-Bench is instantiated from the LogicBencher pipeline, an end-to-end process consisting of three principal phases:
- Task Specification: Each ILP task is defined by the tuple , where provides symbols—constants, predicates, and functions—and serves as a grammar enforcing semantic validity on generated ground atoms. Configurations parameterized as allow fine-grained control over rule sampling (uniform or LLM-guided), rule body length, background data distribution, and positive/negative label balance.
- ILP Task Synthesizer: Algorithm 1 synthesizes each task using three steps:
- Rule Synthesis: Generates a definite Horn clause of the form .
- Background Synthesis: Samples background facts and assigns positive or negative labels () through entailment checking.
- Output Assembly: Returns the latent ground-truth rule , a validation program (encoding all facts and query atoms in Prolog), and a task prompt (in Prolog code or natural language).
- Training & Evaluation: Employs a Symbolic Judge that loads into a Prolog interpreter for deterministic hypothesis verification. Evaluation is based on two scores:
- 0
- 1
The latent rules 2 conform to definite clause logic, e.g., 3, with predicates covering relational (e.g., 4), arithmetic (e.g., 5), and recursive constructs as complexity increases. Argument types include TRAIN, CAR, NUM, etc., grounded over finite discrete sets.
2. Benchmark Construction and Curriculum Design
SLR-Bench is fully synthesized on demand and designed to guarantee novelty with negligible pre-training overlap:
- Contamination-Free Generation: Both rule and data generation are symbolic. The exclusion of any ground-truth rule 6 seen in training from the test set ensures statistical novelty. No human annotation is involved at any step.
- Twenty-Level Curriculum: The benchmark is structured into 20 levels grouped into four complexity tiers—basic (1–5), easy (6–10), medium (11–15), and hard (16–20). Each level’s configuration is fixed via:
- Number of constants (7),
- Number of predicates (8),
- Number of positive/negative examples (9),
- Background sampling policy (0, mirror or uniform),
- Rule body length (1),
- Rule sampling strategy (2, uniform or partially LLM-guided).
Curriculum progression increases vocabulary size, task combinatorics—from 3 tasks at level 1 to 4 at level 20—and logical depth, adding arithmetic and recursive reasoning at higher levels.
| Level Range | Tier | Key Parameter Evolutions |
|---|---|---|
| 1–5 | Basic | Small vocab, mirror backgrounds, 5, 6 |
| 6–10 | Easy | LLM-guided rule gen (30%), vocabulary expansion, 7 grows |
| 11–15 | Medium | Arithmetic ops, more constants/predicates, larger 8 |
| 16–20 | Hard | Recursion, 9 up to 5, uniform backgrounds |
Prompts per level: ~1,000 training, 10 development, 50 test. Early levels have fewer due to limited task space.
3. Validation, Evaluation Metrics, and Symbolic Judging
The validation for SLR-Bench is performed in a fully deterministic manner:
- Validation Programs: Each task includes a Prolog program 0 with all background and query facts.
- Symbolic Judge: Hypotheses are tested for entailment using Prolog; success requires all positives entailed and all negatives refuted.
Evaluation employs comprehensive metrics:
- Logical-Reasoning Level (LRL): 1 (max 20).
- Syntax Score: Proportion of outputs parsable as valid Prolog clauses.
- Tiered Logical-Reasoning Accuracy: Accuracy is disaggregated by tier.
4. Large-Scale LLM Benchmarking and Empirical Findings
Seventeen LLMs, ranging from general-purpose models (GPT-4o, Llama-3-70B) to reasoning-specialized variants (OpenAI o1/o3/o4-mini, DeepSeek-R1), are evaluated zero-shot on a 600-task subset. Empirical results include:
- Syntax Mastery: Most models achieve >85% syntactic validity, indicating strong acquisition of rule formatting.
- Semantic Reasoning Deficit: On medium/hard tiers, generic LLMs fall below 20% accuracy; specialized models sustain >60% on easy and ~40% on medium but also drop below 20% on hard.
- Compute-Efficiency Tradeoff: Reasoning-optimized models generate up to 15,000 completion tokens per query (vs. ~500 for base models), yielding LRL improvements of 5–8 at 20× computational cost. Diminishing returns are observed for inference costs beyond ~5,000 tokens.
5. Logic-Tuning Effects in Llama-3-8B
Extensive logic-tuning experiments are conducted on Llama-3.1-8B using the SLR-Bench training set (16,000 tasks, disjoint from test). Two regimes are analyzed: full fine-tuning (FFT) and parameter-efficient LoRA.
- Pre-Tuning (Zero-Shot): LRL=5.0, syntax=87%, accuracy: 82%, 17%, 1%, 1%.
- Post-FFT: LRL=9.4 (+4.4), syntax=100%, accuracy: [92%, 77%, 17%, 2%].
- Post-LoRA: LRL=8.4 (+3.4), syntax=100%, accuracy: [95%, 57%, 15%, 1%].
- Token Efficiency: Inference requires ~0.04M tokens per query (<1% of o3’s 4.3M; ~1% of Gemini-Flash-Thinking) while matching their LRL (9.4 vs 8.6).
- Generalization: FFT boosts easy-tier accuracy by 60% absolute (17%→77%), indicating systematic generalization to unseen ground-truth rules.
6. Significance, Impact, and Implications
SLR-Bench, instantiated from the LogicBencher pipeline, establishes a systematic, scalable, contamination-free environment for both benchmarking and training LLMs on complex, curriculum-driven logical reasoning tasks. The benchmark’s automation, lack of human annotation, and guaranteed novelty position it as a rigorous diagnostic of inductive reasoning beyond rote pattern completion. Evaluation on contemporary LLMs reveals substantial gaps between syntactic rule generation and robust logical inference, even for models advertised as “reasoning-specialized.” The dramatic improvement in accuracy and efficiency realized through SLR-driven logic-tuning (Editor's term), particularly for Llama-3-8B, suggests that structured logic curricula may be essential for bridging the symbolic-reasoning gap in large generative models (Helff et al., 18 Jun 2025).