LogiEval: Benchmark for LLM Logical Reasoning
- LogiEval is a domain-agnostic benchmark that rigorously evaluates logical reasoning in LLMs using tasks derived from high-stakes human examinations.
- It categorizes reasoning into deductive, inductive, analogical, and abductive types with diverse formats to prevent overfitting and promote generalization.
- LogiEval-Hard isolates scale-invariant challenges by identifying items where even strong small LLMs fail, highlighting inherent reasoning bottlenecks.
LogiEval is a holistic benchmark for evaluating the logical reasoning abilities of LLMs in a domain-agnostic fashion. Designed to rigorously assess models' capabilities beyond domain-specific reasoning, LogiEval comprises a diverse set of reasoning types and task formats derived from high-quality human examination materials. It evaluates deductive, inductive, analogical, and abductive reasoning through a large-scale, curated item bank, and introduces LogiEval-Hard—a diagnostic subset for isolating scale-invariant reasoning bottlenecks in state-of-the-art LLMs. This benchmark provides unprecedented coverage of logical inference patterns and offers nuanced comparative results between machines and human solvers (Liu et al., 17 May 2025).
1. Motivation and Design Rationale
The core objective of LogiEval is to provide a comprehensive and domain-independent testbed for logical reasoning in LLMs—a need unmet by traditional benchmarks focused on mathematics, coding, world knowledge, or narrowly scoped inference patterns (e.g., Modus Ponens alone). Unlike these, LogiEval draws its task pool directly from human high-stakes examinations such as the LSAT, GMAT, Chinese Civil Service, and standardized IQ/aptitude tests, ensuring the benchmark—by construction—cannot be solved through domain-specific heuristics or memorized world facts. This design separates "true" logical reasoning from rote pattern-matching, pushing models to generalize across diverse question styles, languages, and challenge types. Broad reasoning coverage minimizes the risk of superficially high scores due to overfitting on single-format or domain datasets.
2. Reasoning Taxonomy and Format Diversity
LogiEval organizes its tasks along both reasoning category and task format axes. The four canonical types assessed are:
- Deductive reasoning: inferring conclusions that must be true given the premises. Formally, for premises and conclusion , if then is a valid deduction.
- Inductive reasoning: extrapolating general rules from observed instances, e.g., learning statistical regularities or clusters. Expressed as where is a rule and observed instances.
- Analogical reasoning: establishing relational correspondences between distinct domains, with mapping functions that preserve relations over objects.
- Abductive reasoning: inferring the most plausible hypotheses to explain observed evidence, .
Each category is distributed across ten core question formats, including argument analysis (multi-choice), syllogism (multi-choice), artificial language coding/decoding, situational judgment, and others. LogiEval covers a spectrum from two-choice textual entailment to five- and six-choice analogical and abductive questions, counterbalancing formats to avoid gaming via template recognition.
| Reasoning Category | Major Formats (Count) | Sample Choice Counts |
|---|---|---|
| Deductive | textual entailment, syllogism | 2/3/4/5/6-choice |
| Inductive | definition match, odd one out | 3/4-choice |
| Analogical | artificial language, sequences | 4/5-choice |
| Abductive | argument analysis, situational | 4/5-choice |
See (Liu et al., 17 May 2025), Table 1 for full format and item distributions.
3. Dataset Construction, Annotation, and Calibration
LogiEval collects 6,235 instances from the publicly available practice pools of seven competitive human examinations. The selection pipeline:
- Task selection and deduplication: Extract all logic-focused questions, remove near-duplicate/redundant and non-logic items (e.g., math, vocabulary).
- Categorical annotation: Assign reasoning type and task format using a hybrid strategy—automatic prediction via a small reasoning model (Qwen3-30B-A3B), confirmed or corrected by three expert annotators (inter-rater agreement: Fleiss’ k=0.72).
- Metadata and difficulty calibration: Retain original human test metadata (item difficulty, passing rates, gold-standard explanations) to facilitate model–human performance gap quantification.
Preservation of original exam language (bilingual English/Chinese) ensures the dataset is not culturally or linguistically skewed. Item-level annotations enable granular, format- and difficulty-stratified evaluation.
4. Evaluation Protocol and Performance Metrics
Benchmarks are administered with minimal, template-based prompts at a temperature of 0.7 and a token limit of 16k, extracting answers via regular-expression matching on model outputs. Primary metrics:
- Accuracy:
- Human-model gap:
- Statistical testing: Fisher’s exact test for model–human difference; Wilson score confidence intervals for accuracy.
Model evaluations on the full test set show leading LLMs achieve 78.7–81.4% overall accuracy, with DeepSeek-R1 topping at 81.41%. Crucially, performance is format-dependent: argument analysis (4-choice) is model-saturated (81%, exceeding human average of 85.2%), while syllogisms and certain analogical formats remain much harder (syllogism best: 73.4%, others near random).
An error cluster appears in abductive formats and situational-judgement tasks, where >18% of items are universally missed. Non-monotonic model–human differences are observed: items that confound humans are sometimes solved correctly by all models, and vice versa.
5. LogiEval-Hard: Hardness Screening and Diagnostic Subset
LogiEval-Hard is defined as the subset of items where a strong small LLM (Qwen3-30B-A3B) fails in at least 2 out of 3 independent inference trials. This set comprises 1,617 items (≈26% of the main testbed), with a roughly proportional breakdown across reasoning types and formats.
When large LLMs (e.g., Gemini2.0-Flash, 32B parameters) are evaluated on LogiEval-Hard, a precipitous drop is observed: accuracy averages 37.97% (near random guessing), with key failures in syllogism (16.00%), blood-relations (22.73%), and textual entailment (28.00%). These results are consistent across contemporary architectures and highlight generalization bottlenecks persisting even at scale. Analysis indicates that LogiEval-Hard functions as a diagnostic tool for identifying fundamental model weaknesses unrelated to dataset memorization or superficial pattern-matching.
| Task Format | Gemini2.0-Flash (Hard) |
|---|---|
| Syllogism (100) | 16.00% |
| Blood Relations | 22.73% |
| Essential Part | 100% |
| Deductive | 35.66% |
| Abductive | 45.61% |
Data excerpted from (Liu et al., 17 May 2025), Table 3.
6. Comparative Perspective and Benchmark Positioning
LogiEval distinguishes itself from benchmarks like ReClor (LSAT multichoice), CLUTRR (kinship/relational), PrOntoQA and LogicNLI (synthetic entailment), or LogicBench (manual rule pattern enumeration) by aggregating a wide swath of real-world exam logic questions spanning reasoning categories, language, and structure. Unlike single-type datasets, LogiEval’s format coverage and bilinguality prevent overfitting on stylized logic templates and domain-specific knowledge artifacts.
Moreover, LogiEval-Hard is unique among logical reasoning benchmarks in its explicit identification of items that elude both small and large model architectures, serving as an empirical lower-bound on current LLM reasoning capacities.
7. Prospects and Further Directions
LogiEval signals several clear research avenues:
- Expanding coverage: Extending to multimodal, visually grounded logical reasoning to probe cross-modal transfer.
- Process-level evaluation: Developing process-aware metrics, such as proof verification and chain-of-thought (CoT) validation, rather than relying solely on final-answer accuracy.
- Curriculum learning applications: Leveraging LogiEval-Hard as a scale-invariant diagnostic for curriculum construction and targeted model improvement.
- Disentangling pattern-recognition from true inference: Addressing model overperformance on “essential part” identification suggests future metrics must stress failure modes insensitive to pattern- or frequency-matching.
By combining cross-category coverage, exam-grade authenticity, and explicit difficulty stratification, LogiEval and LogiEval-Hard constitute pivotal resources for the scientific paper of logical reasoning in LLMs and are set to become a gold standard for assessing genuine advances in model inference capabilities (Liu et al., 17 May 2025).