PrOntoQA: Synthetic QA for Formal Reasoning
- PrOntoQA is a synthetic QA dataset that uses well-defined first-order logic ontologies to rigorously evaluate chain-of-thought reasoning in language models.
- Its construction employs algorithmic ontology sampling and systematic proof generation to produce unambiguous FOL proofs and control deduction complexity.
- Empirical analyses show that while LLMs excel at local deductive steps, they struggle with global proof planning in multi-hop reasoning scenarios.
PrOntoQA is a synthetic question-answering dataset purpose-built for the systematic, formal evaluation of LLM reasoning via interpretable first-order logic (FOL) ontologies. Introduced in the study "LLMs Are Greedy Reasoners" (Saparov et al., 2022), PrOntoQA provides a controlled environment where each example is generated from a well-defined symbolic world model, enabling rigorous, stepwise analysis of chain-of-thought (CoT) reasoning in LLMs. Distinct from common-sense and real-world knowledge benchmarks, PrOntoQA emphasizes formal deductive reasoning, proof planning, and the diagnosis of error modes in LLM-generated proofs.
1. Formal Ontological World Model
At the core of PrOntoQA lies a compact, synthetic FOL ontology. Each ontology is defined by the following components:
- Signature: A countable set of individual constants (e.g., fae, sally, alex) and a finite set of unary predicate symbols (e.g., , , ), as well as their negations .
- Logical Language : Formula shapes are restricted to atomic membership () and universal implications (“subtype rules”), .
- Ontology as a Directed Acyclic Graph (DAG): Nodes are predicates; directed edges encode subtype axioms (). Edges may optionally encode negated properties ().
- Linear Ontologies: Experimental ontologies are restricted to linear paths of fixed length (3–10), ensuring that proof complexity can be modulated by path length .
- Inference Rules: Only two are permitted—(i) Ax: from infer , and (ii) Hop: from and , infer . All valid proofs are monotonic sequences of Hops on a single individual constant.
This synthetic structure allows for algorithmically generated, unambiguous chains of deductive reasoning, contrasting sharply with knowledge-based or ambiguous naturalistic benchmarks.
2. Dataset Construction and Data Generation Protocol
PrOntoQA instances are constructed in a multi-stage, fully algorithmic workflow:
- Ontology Sampling: Paths of predicates are sampled (, ). With probability $1/2$ per edge, negative property axioms () are injected. Concepts are named either from a small "true" vocabulary or generated as fictional names, with distractor concepts inserted to block lexical shortcuts.
- Proof Generation: For a given path, a random individual constant is asserted at the tail (). Hops are applied in order to propagate membership up the chain, with the final query targeting either positive or negated properties after steps.
- Natural-Language Realization: Axioms and proof steps are programmatically rendered in plain English. Context sentences present the ontology ("Every is a "; "Every reptile is not herbivorous"); queries and CoTs reflect the proof trajectory, with queries randomly choosing positive or negative forms.
- Chain-of-Thought Annotation: Each CoT is a stepwise natural-language realization of the proof, one sentence per deduction step, directly aligned with the formal sequence of Hops leading to the query conclusion.
Such full control and transparency enable precise experimental manipulations, including ontology vocabulary (true/fictional), proof depth, and context-traversal order.
3. Chain-of-Thought Parsing and Proof Reconstruction
A defining methodological advance is the bidirectional mapping between natural-language CoTs and symbolic FOL proofs:
- Semantic Parsing: Each CoT sentence is parsed via a deterministic grammar to its unique -formula (e.g., "Every reptile is not herbivorous" ).
- Proof Validation: Using the context axioms and previously derived facts, each proof step is rigorously evaluated:
- Validity: Step is strictly-valid (due only to Ax or Hop), broadly-valid (permitted by transitivity, not explicitly in gold ontology), or invalid (unlicensed).
- Atomicity: Step is atomic (one Hop) or non-atomic (skips intermediates).
- Utility: Step is correct (progresses toward gold conclusion) or misleading (diverges to dead ends).
- Proof/Label Metrics: Overall proof quality is scored along axes of strict proof-accuracy (all steps atomic, strictly-valid, non-misleading), skip accuracy (allows non-atomic, strictly-valid steps), broad accuracy, and valid accuracy (any strictly or broadly valid steps).
This infrastructure supports precise ablation analyses of LLM reasoning properties at a granular proof-step level.
4. Dataset Organization, Experimentation, and Complexity Controls
PrOntoQA is structured to systematically probe LLM deductive capabilities under varying reasoning demands:
| Experimental Variable | Values | Description |
|---|---|---|
| Ontology Type | Fictional, “true”, “false” | Fictional ontologies, real-world grounded, or inconsistent |
| Number of Hops () | 1, 3, 5 | Controls proof depth, from immediate to multi-step deduction |
| Context Order | Bottom-up, Top-down | Whether context sentences follow proof order or are reversed |
| Prompting Regime | 8-shot in-context learning | Each input: 8 labeled training examples plus 1 new test instance |
| Path/Distractor Structure | Linear chain plus distractor | Ensures no trivial string-matching shortcuts |
| Total Settings | 48 | All combinations of above |
| Test Instances per Setting | 400 | Uniform sampling |
| Total Test Instances | 48 × 400 ≈ 19,200 | Full test set size |
| Predicate Vocabulary Size | Up to 2,000 (fictional), 3 (true) | Varies per ontology, per path length |
Proof structures are strictly linear chains of Hops, with uniformly distributed across the dataset. Controlled complexity is achieved by manipulating and context order, spanning from trivial (depth-1, bottom-up) to challenging (depth-5, top-down, fictional names). Each instance ensures unambiguous FOL structure, with deterministic ground truth.
5. Illustrative Worked Example
A canonical example demonstrates the stepwise structure and evaluation methodology:
3-Hop Fictional Ontology (Bottom-Up Order):
- Context:
1. Fae is a cat. 2. Cats are carnivores. 3. Carnivores are reptiles. 4. Reptiles are not herbivorous.
- Distractor: Zebrites are herbivorous.
- Query: True or false: Fae is herbivorous.
- Gold CoT:
- a. Fae is a cat.
- b. Every cat is a carnivore.
- c. Fae is a carnivore.
- d. Every carnivore is a reptile.
- e. Fae is a reptile.
- f. Every reptile is not herbivorous.
- g. Fae is not herbivorous.
- Label: False
- Formal Proof:
1. Ax: 2. Hop: 3. Hop: 4. Hop:
Each instance thus is fully specified and unambiguous with respect to both language and formal logic.
6. Model Performance Patterns and Error Analysis
Empirical analysis using GPT-3 variants and InstructGPT uncovers key LLM reasoning signatures:
- Only the largest model (e.g., text-davinci-002) reliably outperforms chance; smaller models frequently default to random guessing.
- On "true" ontologies, models exhibit high accuracy (≈85–90% for ) due to possible memorization and shortcut-skipping of deductive steps.
- Deeper or fictional/inconsistent ontologies result in degraded performance: accuracy drops from ≈90% at to ≈50–60% at , particularly under top-down context order.
- Most generated CoT steps (>90%) are strictly-valid modus ponens hops, regardless of whether the entire proof is correct.
- Error analysis reveals that the dominant failure mode is strictly-valid but misleading atomic steps: models "greedily" choose a next step and cannot backtrack if the path is suboptimal (proof planning bottleneck).
- Neither self-consistency sampling nor explicit in-context demonstration of depth-first search mitigates planning errors; LLMs rarely explore alternative deduction branches.
This suggests that contemporary LLMs are highly effective at local deduction but lack global proof-planning competence, particularly when multiple valid inference paths exist.
7. Significance and Research Implications
PrOntoQA advances benchmark methodology for formal reasoning evaluation in several respects:
- Every example admits a programmatically generated, exact FOL proof, allowing complete transparency and unambiguous stepwise error attribution.
- The dataset supports granular metrics (atomicity, validity, and utility per step), exposing differentiated error sources—notably step-level vs. planning-level failures.
- By varying ontology vocabulary and order, PrOntoQA systematically isolates memorization, lexical shortcutting, and true deductive ability.
- The findings reveal a dissociation between LLMs’ local deduction accuracy and global proof-plan construction, establishing a new controlled paradigm for future interventions in model reasoning architecture and training (Saparov et al., 2022).
A plausible implication is that further improvements in formal reasoning will require architectures or training regimes that explicitly model proof search, rather than incrementally accumulating strictly valid but potentially misleading steps.
References
- "LLMs Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought" (Saparov et al., 2022)