LogiCoT: Logical Chain-of-Thought Tuning

Updated 2 December 2025

The paper presents a novel instruction-tuning dataset that leverages GPT-4 chain-of-thought exemplars to boost multi-step logical reasoning, achieving a +16 pp gain on LogiEval.
The methodology trains a decoder-only transformer (LLaMA-7b) using structured symbolic prompts to generate explicit, rule-linked inference chains.
The evaluation shows significant improvements on benchmarks such as LogiQA and MMLU, reflecting enhanced formal inference and reasoning capabilities.

LogiCoT (Logical Chain-of-Thought Instruction-Tuning) is a large-scale instruction-tuning methodology and dataset designed to enhance the logical reasoning abilities of LLMs. Unlike generic self-instruction pipelines that optimize for broad proficiency in everyday natural language tasks, LogiCoT explicitly targets multi-step, formally grounded inference. By harvesting chain-of-thought (CoT) reasoning exemplars from GPT-4 and applying them to open-source models via targeted instruction sets, LogiCoT substantially advances community-accessible LLM competence in logical domains (Liu et al., 2023).

1. Motivation and Background

Existing self-instruction approaches, exemplified by Alpaca and Vicuna, use “instruction → response” pairs for generic tasks such as summarization and paraphrasing. Although these datasets lead to community models displaying GPT-3.5-level performance on broadly scoped benchmarks, they are inadequate for tasks demanding structured, multi-step reasoning, entailment, or symbolic logic. Community-tuned models exhibit marked deficits on puzzles, explicit entailment chains, and formal logic problems, even under iterative prompting (e.g., “Let’s think step by step”). This deficiency stems from the absence of deeply elaborated CoT exemplars—which LogiCoT directly remedies.

GPT-4’s robust zero- and few-shot reasoning capabilities, especially when prompted to generate detailed CoT rationales for logic, mathematics, and inference chains, underpin LogiCoT’s data enrichment strategy. The pivotal insight is that such high-fidelity CoT rationales can be extracted and used to elevate smaller, open-source transformer architectures’ reasoning powers (Liu et al., 2023).

2. Dataset Construction

LogiCoT is constructed by curating and expanding existing logic and comprehension corpora, formulating specialized instructions for GPT-4, and harvesting its stepwise rationale outputs. Primary data sources include:

LogicInference (propositional and first-order logic chains)
EntailmentBank (science questions with human entailment trees)
FOLIO (first-order logic engine annotations)
ReClor and LogiQA (LSAT-style verbal reasoning)

Instruction types span symbolic and natural inference:

Language→Logic (semantic formalization, e.g., “∀x (Square(x) → FourSides(x))”)
One-Step Inference (deduce all possible conclusions from premises, name each inference rule)
Inference Chain (construct an explicit stepwise proof, identifying applied rules)
Multi-choice MRC (argument assumption, strengthening/weaking reasoning, error analysis)

Prompts are assembled into JSON with “instruction”, “input”, and (optionally) gold labels for output. GPT-4 responses are filtered manually to remove incomplete or erroneous rationales.

Dataset Statistics:

Instances: 68,983
Languages→Logic: 13,206
One-Step Inference: 23,943
Inference Chain: 26,228
Multi-Choice MRC: 5,606

Quality audits by human annotators yield high scores: relevance 4.9, coherence 4.7, completeness 4.5, faithfulness 4.5 (all on 1–5 scales), and Cohen's κ = 0.87 for inter-annotator agreement (Liu et al., 2023).

3. Instruction-Tuning Methodology

The LogiCoT instruction-tuning procedure deploys a decoder-only transformer, specifically LLaMA-7b, targeting the joint generation of rationale and answer tokens conditioned on the concatenation of instruction and input. Training minimizes the negative log-likelihood over these outputs using standard @@@@2@@@@ objectives.

Key training hyperparameters:

Batch size: 4
Learning rate: $2 \times 10^{-5}$
Epochs: 2
Compute: 2 × NVIDIA A100 GPUs (with DeepSpeed)
Training duration: ~4 days
Model output: LLaMA-7b-logicot

Prompts and responses maintain rigorous symbolic notation ( $\forall x (P(x) \leftrightarrow Q(x))$ , $p \rightarrow q$ , $\neg r$ ) and explicitly cite reasoning rules such as modus tollens or biconditional elimination (Liu et al., 2023).

4. Benchmark Evaluation and Results

Evaluation employs the LogiEval suite (8 logical reasoning and multi-choice NLI/MRC tasks: LogiQA 2.0 EN/zh, ReClor, AR-LSAT, LogiQA OOD, ConTRoL, HELP, TaxiNLI) and general task MMLU (57 subjects).

Model Performance (Accuracy, %):

Model	LogiEval	MMLU
LLaMA-7b-base	20.22	31.8
LLaMA-7b-logicot	40.69	43.3
LLaMA-30b-supercot	24.78	—
Falcon-40b-instruct	23.63	—
ChatGPT	47.46	—
GPT-4	60.56	—

Ablation studies confirm substantial degradation if any instruction type is removed (e.g., omitting “Inference Chain” reduces LogiEval accuracy to 30.8%). Gains are observed across both logical and general benchmarks, with LogiCoT yielding a +16 pp gain over best prior open-source results for LogiEval, and +11.5 pp for MMLU (Liu et al., 2023).

5. Qualitative Analysis of Reasoning

LogiCoT-tuned models generate multi-step, explicitly rule-linked rationales for logical and argumentative tasks. Outputs preserve all symbolic notation and procedural naming found in the exemplars. In contrast, earlier instruction-tuned open models (SuperCoT, Falcon-instruct) often deliver only surface-level inferences, omit intermediate steps, or fail to identify reasoning rules.

Correct LogiCoT responses demonstrate chain-of-thought competence on strengthening arguments (e.g., referencing both key assumption and passage content) and constructing multi-step logical inference chains. Failure modes include occasional hallucination of inference steps or experiment-specific constraints (faithfulness score: 4.5/5), with notable weaknesses on passage-level Chinese MRC and 5-way AR-LSAT tasks (Liu et al., 2023).

6. Impact, Limitations, and Future Directions

LogiCoT represents, per the original authors, the first major instruction dataset specifically devoted to logical chain-of-thought processes combining formal logic and argumentative reasoning. When applied to a 7B-parameter base model, LogiCoT-tuning yields marked performance improvements on specialist benchmarks and robust gains on broad task suites.

Limitations remain, including trailing proprietary GPT-4 on certain benchmarks, limited faithfulness of some harvested GPT-4 chains, and challenging adaptation to multi-choice structures outside the training distribution. Planned enhancements include improved data curation (automated filtering, human-in-the-loop), scaling to larger models (LLaMA-13B/30B), extension to dialogue or interactive chain-of-thought settings, and integration with symbolic solvers to further ground model reasoning in formal logic (Liu et al., 2023).

All LogiCoT data and trained models are publicly released to facilitate continued research into high-fidelity logical proficiency for open-access LLMs.

PDF Markdown Chat (Pro)

References (1)

LogiCoT: Logical Chain-of-Thought Instruction-Tuning (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to LogiCoT: Logical Chain-of-Thought Instruction-Tuning.