LogicLLM: Logical Reasoning for LLMs

Updated 7 November 2025

LogicLLM is a self-supervised framework that boosts logical reasoning in large language models by mining and reconstructing logically consistent pairs from unstructured text.
It employs a generative auto-regressive objective with counterfactual data augmentation to enable multi-step deduction without the need for annotated logic data.
Experiments show improved accuracy on benchmarks like ReClor and LogiQA-v2, while preserving overall language understanding across standard tasks.

LogicLLM is a self-supervised, post-training framework designed to improve the logical reasoning capacity of LLMs such as FLAN-T5 and LLaMA, using only unannotated, naturally occurring text. Unlike approaches that depend on supervised fine-tuning or logic-specific annotation, LogicLLM relies on mining logically consistent patterns from unstructured corpora (notably Wikipedia) and training LLMs using a generative, auto-regressive objective. Its efficacy is demonstrated through substantial gains on logic benchmarks (ReClor, LogiQA-v2) without negative impact on general language understanding tasks.

1. Motivation and Conceptual Design

LogicLLM addresses the inherent deficits of LLMs in formal logical reasoning. Traditional pretraining and instruction tuning have not conferred strong abilities in multi-step deduction or composition of relational knowledge, and models show poor sample efficiency on logical reasoning benchmarks relative to their performance on general NLP tasks. To improve compositional generalization without sacrificing language skills, LogicLLM introduces a self-supervised task that directly targets the backbone of symbolic reasoning: the ability to relate and generate logically consistent statements across direct and indirect paraphrasing in unstructured text.

2. Self-Supervised Logic-Oriented Data Mining

The core insight is that corpora such as Wikipedia, with their dense relational content, often encode the same real-world relationship in several forms:

Direct relations: Statements where a relationship between two entities $e_i$ and $e_j$ is stated outright.
Indirect relations: Multi-hop chains connecting $e_i$ to $e_j$ via one or more intermediaries.

LogicLLM systematically mines these cases, constructing training pairs where the same tuple $(e_i, e_j)$ appears both as a direct mention and as a composition over intermediate entities. Fuzzy logical consistency is assumed (i.e., multiple expressions can approximate the same inference). A counterfactual data augmentation step then replaces entities in the relations with alternatives, producing logic-structure-preserving but content-shuffled examples that enforce logic-based over entity-based generalization.

3. Training Objective and Procedure

LogicLLM utilizes an auto-regressive generative objective, diverging from contrastive or discriminative paradigms (e.g., MERIt). For each logic-consistent pair $(R_i^1, R_i^2)$ the model is trained to generate $R_i^2$ given $R_i^1$ , and vice versa, using standard next-token prediction:

$\mathcal{L}_{\mathrm{logic}} = -\sum_{i=1}^{N} \left[ \log P(R_i^1|R_i^2) + \log P(R_i^2|R_i^1) \right]$

with explicit token-level unrolling: $-\sum_{i=1}^N \Bigg[ \sum_{j=1}^{|R_i^1|}\log P(R_{i,j}^1|R_{i,<j}^1, R_i^2)+\sum_{j=1}^{|R_i^2|}\log P(R_{i,j}^2|R_{i,<j}^2, R_i^1)\Bigg]$

To prevent catastrophic forgetting of general language knowledge, the total loss also includes standard language modeling: $\mathcal{L} = \mathcal{L}_{\mathrm{logic}} + \mathcal{L}_{\mathrm{lm}}$ where batches interleave logic-augmented and ordinary samples.

4. Integration and Model Compatibility

Training is implemented as a continual meta-learning step atop models such as FLAN-T5 and LLaMA, with parameter sizes from 3B to 33B and beyond (QLoRA is used for memory-efficient training on large models). No task-specific annotation, logic templates, or external supervision is required. The only input is a natural text corpus, from which logical pairs are extracted. The learning process is robust to additional instruction tuning data: multitask setups combining LogicLLM and instruction data further enhance logic-specific and general capabilities.

5. Experimental Results and Ablation Findings

LogicLLM demonstrates statistically significant improvements on logic benchmarks: on ReClor, FLAN-T5-11B rises from 59.9% to 61.1% test accuracy, and LLaMA-13B from 33.5% to 36.3%. On LogiQA-v2, improvements are smaller but consistent. Zero regression is observed on language understanding (RACE, MMLU, BBH), and in some cases performance increases.

Key ablation findings are:

The generative, auto-regressive objective is crucial; contrastive objectives as in MERIt confer negligible benefit in this (in-context) setting.
Removing counterfactual augmentation reduces performance gains; entity-swapping supports stronger logic-based generalization.
LogicLLM synergizes with instruction tuning: combining meta-training with instruction datasets yields further improvements, particularly for natural language-heavy logic benchmarks.
LogicLLM enhances robustness to superficial input perturbations (e.g., choice shuffling), indicating less reliance on positional or lexical artifacts.

6. Activation of Logical Reasoning and In-Context Use

LogicLLM-trained models are evaluated by providing one half of a logic-consistent pair as in-context demonstration, requiring the model to generate the other—effectively demanding logical composition and paraphrase. This activates logical knowledge via in-context learning, reflecting the nature of many real reasoning tasks (e.g., reconstructing implicit relationships, abstracting over multi-step inferences).

Empirically, models so trained are less susceptible to semantic drift and more able to handle logic-specific prompt requirements. On the practical side, the method does not degrade general capabilities—unlike some prior domain-specialized fine-tuning.

7. Broader Implications and Context in Logic-Enhanced LLMs

LogicLLM exemplifies a new class of logic-enhanced, annotation-free training techniques that:

Scale naturally with increasing model and data size, since no labeled data or specially curated logic benchmarks are needed for training.
Are compatible with both open- and closed-source LLMs as long as their training APIs support continual or parameter-efficient adaptation.
Can be layered with instruction or multi-task fine-tuning, permitting modular upgrades to logical reasoning without compromising existing skills.

Comparatively, LogicLLM outperforms contrastive logic representation learning (MERIt), and offers a task-agnostic, compositional backbone for logic capability parallel to logic-specific approaches using deductive proofs or synthetic logic corpora.

Summary Table: LogicLLM Key Features and Results

Feature	Description
Training setup	Self-supervised (Wikipedia/mined corpora, no annotation)
Objective	Generative, auto-regressive; reconstruct logical paraphrase given context
Data augmentation	Entity counterfactuals to enforce logic over memorization
Model families supported	FLAN-T5, LLaMA, others with autoregressive LM interface
Evaluation benchmarks	ReClor, LogiQA-v2, RACE, MMLU, BBH
Main empirical result	+3–5 accuracy points on logic benchmarks, competitive with ChatGPT
Effect on general ability	Neutral or slight improvement

LogicLLM demonstrates that purely self-supervised, structurally-motivated training can generate substantial logical reasoning improvements in LLMs, provided the learning objective matches both the generative and compositional nature of real-world logic. The approach establishes a scalable, annotation-free route toward robust logic capabilities in foundation models, directly leveraging the richness of raw text corpora to instill genuine reasoning ability.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to LogicLLM.