SLR-Bench: Automated ILP Benchmarking

Updated 9 May 2026

SLR-Bench is a benchmark for automated inductive logic programming tasks that employs a contamination-free, fully synthetic data generation process.
It utilizes the LogicBencher pipeline to synthesize tasks through rule formalism, background fact generation, and deterministic Prolog-based validation.
Empirical results show that logic-tuning with SLR-Bench notably enhances LLMs’ logical reasoning ability while exposing trade-offs in compute efficiency.

SLR-Bench is a systematically generated benchmark for evaluating and training LLMs on fully automated inductive logic programming (ILP) tasks with precisely controlled difficulty. Constructed via the SLR (Scalable Logical Reasoning) framework’s end-to-end LogicBencher pipeline, SLR-Bench comprises over 19,000 prompts distributed across 20 curriculum levels, each designed to probe progressively complex forms of relational, arithmetic, and recursive reasoning in LLMs. SLR-Bench is contamination-free by construction, requires no human annotation, and supports both zero-shot evaluation and systematic training of LLMs to advance their logical reasoning capabilities (Helff et al., 18 Jun 2025).

1. Generation Pipeline: Task Synthesis and Rule Formalism

SLR-Bench is instantiated from the LogicBencher pipeline, an end-to-end process consisting of three principal phases:

Task Specification: Each ILP task is defined by the tuple $L = (V, G)$ , where $V$ provides symbols—constants, predicates, and functions—and $G$ serves as a grammar enforcing semantic validity on generated ground atoms. Configurations parameterized as $\Theta = \langle R_\text{sample}, R_\text{len}, B_\pi, \kappa \rangle$ allow fine-grained control over rule sampling (uniform or LLM-guided), rule body length, background data distribution, and positive/negative label balance.
ILP Task Synthesizer: Algorithm 1 synthesizes each task using three steps:
- Rule Synthesis: Generates a definite Horn clause $R^*$ of the form $h(\cdots) :- b_1(\cdots), \ldots, b_{R_\text{len}}(\cdots)$ .
- Background Synthesis: Samples background facts and assigns positive or negative labels ( $\kappa_+, \kappa_-$ ) through entailment checking.
- Output Assembly: Returns the latent ground-truth rule $R^*$ , a validation program $P_\text{val}$ (encoding all facts and query atoms in Prolog), and a task prompt (in Prolog code or natural language).
Training & Evaluation: Employs a Symbolic Judge that loads $P_\text{val}$ $P_{val}$ into a Prolog interpreter for deterministic hypothesis verification. Evaluation is based on two scores:
- $V$ 0
- $V$ 1

The latent rules $V$ 2 conform to definite clause logic, e.g., $V$ 3, with predicates covering relational (e.g., $V$ 4), arithmetic (e.g., $V$ 5), and recursive constructs as complexity increases. Argument types include TRAIN, CAR, NUM, etc., grounded over finite discrete sets.

2. Benchmark Construction and Curriculum Design

SLR-Bench is fully synthesized on demand and designed to guarantee novelty with negligible pre-training overlap:

Contamination-Free Generation: Both rule and data generation are symbolic. The exclusion of any ground-truth rule $V$ 6 seen in training from the test set ensures statistical novelty. No human annotation is involved at any step.
Twenty-Level Curriculum: The benchmark is structured into 20 levels grouped into four complexity tiers—basic (1–5), easy (6–10), medium (11–15), and hard (16–20). Each level’s configuration is fixed via:
- Number of constants ( $V$ 7),
- Number of predicates ( $V$ 8),
- Number of positive/negative examples ( $V$ 9),
- Background sampling policy ( $G$ 0, mirror or uniform),
- Rule body length ( $G$ 1),
- Rule sampling strategy ( $G$ 2, uniform or partially LLM-guided).

Curriculum progression increases vocabulary size, task combinatorics—from $G$ 3 tasks at level 1 to $G$ 4 at level 20—and logical depth, adding arithmetic and recursive reasoning at higher levels.

Level Range	Tier	Key Parameter Evolutions
1–5	Basic	Small vocab, mirror backgrounds, $G$ 5, $G$ 6
6–10	Easy	LLM-guided rule gen (30%), vocabulary expansion, $G$ 7 grows
11–15	Medium	Arithmetic ops, more constants/predicates, larger $G$ 8
16–20	Hard	Recursion, $G$ 9 up to 5, uniform backgrounds

Prompts per level: ~1,000 training, 10 development, 50 test. Early levels have fewer due to limited task space.

3. Validation, Evaluation Metrics, and Symbolic Judging

The validation for SLR-Bench is performed in a fully deterministic manner:

Validation Programs: Each task includes a Prolog program $\Theta = \langle R_\text{sample}, R_\text{len}, B_\pi, \kappa \rangle$ 0 with all background and query facts.
Symbolic Judge: Hypotheses are tested for entailment using Prolog; success requires all positives entailed and all negatives refuted.

Evaluation employs comprehensive metrics:

Logical-Reasoning Level (LRL): $\Theta = \langle R_\text{sample}, R_\text{len}, B_\pi, \kappa \rangle$ 1 (max 20).
Syntax Score: Proportion of outputs parsable as valid Prolog clauses.
Tiered Logical-Reasoning Accuracy: Accuracy is disaggregated by tier.

4. Large-Scale LLM Benchmarking and Empirical Findings

Seventeen LLMs, ranging from general-purpose models (GPT-4o, Llama-3-70B) to reasoning-specialized variants (OpenAI o1/o3/o4-mini, DeepSeek-R1), are evaluated zero-shot on a 600-task subset. Empirical results include:

Syntax Mastery: Most models achieve >85% syntactic validity, indicating strong acquisition of rule formatting.
Semantic Reasoning Deficit: On medium/hard tiers, generic LLMs fall below 20% accuracy; specialized models sustain >60% on easy and ~40% on medium but also drop below 20% on hard.
Compute-Efficiency Tradeoff: Reasoning-optimized models generate up to 15,000 completion tokens per query (vs. ~500 for base models), yielding LRL improvements of 5–8 at 20× computational cost. Diminishing returns are observed for inference costs beyond ~5,000 tokens.

5. Logic-Tuning Effects in Llama-3-8B

Extensive logic-tuning experiments are conducted on Llama-3.1-8B using the SLR-Bench training set (16,000 tasks, disjoint from test). Two regimes are analyzed: full fine-tuning (FFT) and parameter-efficient LoRA.

Pre-Tuning (Zero-Shot): LRL=5.0, syntax=87%, accuracy: 82%, 17%, 1%, 1%.
Post-FFT: LRL=9.4 (+4.4), syntax=100%, accuracy: [92%, 77%, 17%, 2%].
Post-LoRA: LRL=8.4 (+3.4), syntax=100%, accuracy: [95%, 57%, 15%, 1%].
Token Efficiency: Inference requires ~0.04M tokens per query (<1% of o3’s 4.3M; ~1% of Gemini-Flash-Thinking) while matching their LRL (9.4 vs 8.6).
Generalization: FFT boosts easy-tier accuracy by 60% absolute (17%→77%), indicating systematic generalization to unseen ground-truth rules.

6. Significance, Impact, and Implications

SLR-Bench, instantiated from the LogicBencher pipeline, establishes a systematic, scalable, contamination-free environment for both benchmarking and training LLMs on complex, curriculum-driven logical reasoning tasks. The benchmark’s automation, lack of human annotation, and guaranteed novelty position it as a rigorous diagnostic of inductive reasoning beyond rote pattern completion. Evaluation on contemporary LLMs reveals substantial gaps between syntactic rule generation and robust logical inference, even for models advertised as “reasoning-specialized.” The dramatic improvement in accuracy and efficiency realized through SLR-driven logic-tuning (Editor's term), particularly for Llama-3-8B, suggests that structured logic curricula may be essential for bridging the symbolic-reasoning gap in large generative models (Helff et al., 18 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (1)

SLR: An Automated Synthesis Framework for Scalable Logical Reasoning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SLR-Bench.

SLR-Bench: Automated ILP Benchmarking

1. Generation Pipeline: Task Synthesis and Rule Formalism

2. Benchmark Construction and Curriculum Design

3. Validation, Evaluation Metrics, and Symbolic Judging

4. Large-Scale LLM Benchmarking and Empirical Findings

5. Logic-Tuning Effects in Llama-3-8B

6. Significance, Impact, and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SLR-Bench: Automated ILP Benchmarking

1. Generation Pipeline: Task Synthesis and Rule Formalism

2. Benchmark Construction and Curriculum Design

3. Validation, Evaluation Metrics, and Symbolic Judging

4. Large-Scale LLM Benchmarking and Empirical Findings

5. Logic-Tuning Effects in Llama-3-8B

6. Significance, Impact, and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research