Papers
Topics
Authors
Recent
Search
2000 character limit reached

SLR-Bench: Automated ILP Benchmarking

Updated 9 May 2026
  • SLR-Bench is a benchmark for automated inductive logic programming tasks that employs a contamination-free, fully synthetic data generation process.
  • It utilizes the LogicBencher pipeline to synthesize tasks through rule formalism, background fact generation, and deterministic Prolog-based validation.
  • Empirical results show that logic-tuning with SLR-Bench notably enhances LLMs’ logical reasoning ability while exposing trade-offs in compute efficiency.

SLR-Bench is a systematically generated benchmark for evaluating and training LLMs on fully automated inductive logic programming (ILP) tasks with precisely controlled difficulty. Constructed via the SLR (Scalable Logical Reasoning) framework’s end-to-end LogicBencher pipeline, SLR-Bench comprises over 19,000 prompts distributed across 20 curriculum levels, each designed to probe progressively complex forms of relational, arithmetic, and recursive reasoning in LLMs. SLR-Bench is contamination-free by construction, requires no human annotation, and supports both zero-shot evaluation and systematic training of LLMs to advance their logical reasoning capabilities (Helff et al., 18 Jun 2025).

1. Generation Pipeline: Task Synthesis and Rule Formalism

SLR-Bench is instantiated from the LogicBencher pipeline, an end-to-end process consisting of three principal phases:

  1. Task Specification: Each ILP task is defined by the tuple L=(V,G)L = (V, G), where VV provides symbols—constants, predicates, and functions—and GG serves as a grammar enforcing semantic validity on generated ground atoms. Configurations parameterized as Θ=Rsample,Rlen,Bπ,κ\Theta = \langle R_\text{sample}, R_\text{len}, B_\pi, \kappa \rangle allow fine-grained control over rule sampling (uniform or LLM-guided), rule body length, background data distribution, and positive/negative label balance.
  2. ILP Task Synthesizer: Algorithm 1 synthesizes each task using three steps:
    • Rule Synthesis: Generates a definite Horn clause RR^* of the form h():b1(),,bRlen()h(\cdots) :- b_1(\cdots), \ldots, b_{R_\text{len}}(\cdots).
    • Background Synthesis: Samples background facts and assigns positive or negative labels (κ+,κ\kappa_+, \kappa_-) through entailment checking.
    • Output Assembly: Returns the latent ground-truth rule RR^*, a validation program PvalP_\text{val} (encoding all facts and query atoms in Prolog), and a task prompt (in Prolog code or natural language).
  3. Training & Evaluation: Employs a Symbolic Judge that loads PvalP_\text{val} into a Prolog interpreter for deterministic hypothesis verification. Evaluation is based on two scores:
    • VV0
    • VV1

The latent rules VV2 conform to definite clause logic, e.g., VV3, with predicates covering relational (e.g., VV4), arithmetic (e.g., VV5), and recursive constructs as complexity increases. Argument types include TRAIN, CAR, NUM, etc., grounded over finite discrete sets.

2. Benchmark Construction and Curriculum Design

SLR-Bench is fully synthesized on demand and designed to guarantee novelty with negligible pre-training overlap:

  • Contamination-Free Generation: Both rule and data generation are symbolic. The exclusion of any ground-truth rule VV6 seen in training from the test set ensures statistical novelty. No human annotation is involved at any step.
  • Twenty-Level Curriculum: The benchmark is structured into 20 levels grouped into four complexity tiers—basic (1–5), easy (6–10), medium (11–15), and hard (16–20). Each level’s configuration is fixed via:
    • Number of constants (VV7),
    • Number of predicates (VV8),
    • Number of positive/negative examples (VV9),
    • Background sampling policy (GG0, mirror or uniform),
    • Rule body length (GG1),
    • Rule sampling strategy (GG2, uniform or partially LLM-guided).

Curriculum progression increases vocabulary size, task combinatorics—from GG3 tasks at level 1 to GG4 at level 20—and logical depth, adding arithmetic and recursive reasoning at higher levels.

Level Range Tier Key Parameter Evolutions
1–5 Basic Small vocab, mirror backgrounds, GG5, GG6
6–10 Easy LLM-guided rule gen (30%), vocabulary expansion, GG7 grows
11–15 Medium Arithmetic ops, more constants/predicates, larger GG8
16–20 Hard Recursion, GG9 up to 5, uniform backgrounds

Prompts per level: ~1,000 training, 10 development, 50 test. Early levels have fewer due to limited task space.

3. Validation, Evaluation Metrics, and Symbolic Judging

The validation for SLR-Bench is performed in a fully deterministic manner:

  • Validation Programs: Each task includes a Prolog program Θ=Rsample,Rlen,Bπ,κ\Theta = \langle R_\text{sample}, R_\text{len}, B_\pi, \kappa \rangle0 with all background and query facts.
  • Symbolic Judge: Hypotheses are tested for entailment using Prolog; success requires all positives entailed and all negatives refuted.

Evaluation employs comprehensive metrics:

  • Logical-Reasoning Level (LRL): Θ=Rsample,Rlen,Bπ,κ\Theta = \langle R_\text{sample}, R_\text{len}, B_\pi, \kappa \rangle1 (max 20).
  • Syntax Score: Proportion of outputs parsable as valid Prolog clauses.
  • Tiered Logical-Reasoning Accuracy: Accuracy is disaggregated by tier.

4. Large-Scale LLM Benchmarking and Empirical Findings

Seventeen LLMs, ranging from general-purpose models (GPT-4o, Llama-3-70B) to reasoning-specialized variants (OpenAI o1/o3/o4-mini, DeepSeek-R1), are evaluated zero-shot on a 600-task subset. Empirical results include:

  • Syntax Mastery: Most models achieve >85% syntactic validity, indicating strong acquisition of rule formatting.
  • Semantic Reasoning Deficit: On medium/hard tiers, generic LLMs fall below 20% accuracy; specialized models sustain >60% on easy and ~40% on medium but also drop below 20% on hard.
  • Compute-Efficiency Tradeoff: Reasoning-optimized models generate up to 15,000 completion tokens per query (vs. ~500 for base models), yielding LRL improvements of 5–8 at 20× computational cost. Diminishing returns are observed for inference costs beyond ~5,000 tokens.

5. Logic-Tuning Effects in Llama-3-8B

Extensive logic-tuning experiments are conducted on Llama-3.1-8B using the SLR-Bench training set (16,000 tasks, disjoint from test). Two regimes are analyzed: full fine-tuning (FFT) and parameter-efficient LoRA.

  • Pre-Tuning (Zero-Shot): LRL=5.0, syntax=87%, accuracy: 82%, 17%, 1%, 1%.
  • Post-FFT: LRL=9.4 (+4.4), syntax=100%, accuracy: [92%, 77%, 17%, 2%].
  • Post-LoRA: LRL=8.4 (+3.4), syntax=100%, accuracy: [95%, 57%, 15%, 1%].
  • Token Efficiency: Inference requires ~0.04M tokens per query (<1% of o3’s 4.3M; ~1% of Gemini-Flash-Thinking) while matching their LRL (9.4 vs 8.6).
  • Generalization: FFT boosts easy-tier accuracy by 60% absolute (17%→77%), indicating systematic generalization to unseen ground-truth rules.

6. Significance, Impact, and Implications

SLR-Bench, instantiated from the LogicBencher pipeline, establishes a systematic, scalable, contamination-free environment for both benchmarking and training LLMs on complex, curriculum-driven logical reasoning tasks. The benchmark’s automation, lack of human annotation, and guaranteed novelty position it as a rigorous diagnostic of inductive reasoning beyond rote pattern completion. Evaluation on contemporary LLMs reveals substantial gaps between syntactic rule generation and robust logical inference, even for models advertised as “reasoning-specialized.” The dramatic improvement in accuracy and efficiency realized through SLR-driven logic-tuning (Editor's term), particularly for Llama-3-8B, suggests that structured logic curricula may be essential for bridging the symbolic-reasoning gap in large generative models (Helff et al., 18 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SLR-Bench.