PrOntoQA: Synthetic QA for Formal Reasoning

Updated 2 February 2026

PrOntoQA is a synthetic QA dataset that uses well-defined first-order logic ontologies to rigorously evaluate chain-of-thought reasoning in language models.
Its construction employs algorithmic ontology sampling and systematic proof generation to produce unambiguous FOL proofs and control deduction complexity.
Empirical analyses show that while LLMs excel at local deductive steps, they struggle with global proof planning in multi-hop reasoning scenarios.

PrOntoQA is a synthetic question-answering dataset purpose-built for the systematic, formal evaluation of LLM reasoning via interpretable first-order logic (FOL) ontologies. Introduced in the study "LLMs Are Greedy Reasoners" (Saparov et al., 2022), PrOntoQA provides a controlled environment where each example is generated from a well-defined symbolic world model, enabling rigorous, stepwise analysis of chain-of-thought (CoT) reasoning in LLMs. Distinct from common-sense and real-world knowledge benchmarks, PrOntoQA emphasizes formal deductive reasoning, proof planning, and the diagnosis of error modes in LLM-generated proofs.

1. Formal Ontological World Model

At the core of PrOntoQA lies a compact, synthetic FOL ontology. Each ontology is defined by the following components:

Signature: A countable set of individual constants (e.g., fae, sally, alex) and a finite set of unary predicate symbols $P, Q, R, \ldots$ (e.g., $\mathrm{Cat}(\cdot)$ , $\mathrm{Carnivore}(\cdot)$ , $\mathrm{Herbivorous}(\cdot)$ ), as well as their negations $\neg P(\cdot)$ .
Logical Language $\mathcal{L}$ : Formula shapes are restricted to atomic membership ( $P(a), \neg P(a)$ ) and universal implications (“subtype rules”), $\forall x\,[P(x)\rightarrow Q(x)]$ .
Ontology as a Directed Acyclic Graph (DAG): Nodes are predicates; directed edges $P\rightarrow Q$ encode subtype axioms ( $\forall x\,[P(x)\rightarrow Q(x)]$ ). Edges may optionally encode negated properties ( $\forall x\,[P(x)\rightarrow\neg R(x)]$ ).
Linear Ontologies: Experimental ontologies are restricted to linear paths of fixed length (3–10), ensuring that proof complexity can be modulated by path length $k$ .
Inference Rules: Only two are permitted—(i) Ax: from $A$ infer $A$ , and (ii) Hop: from $\forall x [P(x)\rightarrow Q(x)]$ and $P(a)$ , infer $Q(a)$ . All valid proofs are monotonic sequences of Hops on a single individual constant.

This synthetic structure allows for algorithmically generated, unambiguous chains of deductive reasoning, contrasting sharply with knowledge-based or ambiguous naturalistic benchmarks.

2. Dataset Construction and Data Generation Protocol

PrOntoQA instances are constructed in a multi-stage, fully algorithmic workflow:

Ontology Sampling: Paths of $k$ predicates are sampled ( $P_0\rightarrow P_1\rightarrow\cdots\rightarrow P_k$ , $k \in \{1,3,5\}$ ). With probability $1/2$ per edge, negative property axioms ( $\forall x[P_i(x)\rightarrow\neg R(x)]$ ) are injected. Concepts are named either from a small "true" vocabulary or generated as fictional names, with distractor concepts inserted to block lexical shortcuts.
Proof Generation: For a given path, a random individual constant $a$ is asserted at the tail ( $P_0(a)$ ). Hops are applied in order to propagate membership up the chain, with the final query targeting either positive or negated properties after $k$ steps.
Natural-Language Realization: Axioms and proof steps are programmatically rendered in plain English. Context sentences present the ontology ("Every $P_i$ is a $P_{i+1}$ "; "Every reptile is not herbivorous"); queries and CoTs reflect the proof trajectory, with queries randomly choosing positive or negative forms.
Chain-of-Thought Annotation: Each CoT is a stepwise natural-language realization of the proof, one sentence per deduction step, directly aligned with the formal sequence of Hops leading to the query conclusion.

Such full control and transparency enable precise experimental manipulations, including ontology vocabulary (true/fictional), proof depth, and context-traversal order.

3. Chain-of-Thought Parsing and Proof Reconstruction

A defining methodological advance is the bidirectional mapping between natural-language CoTs and symbolic FOL proofs:

Semantic Parsing: Each CoT sentence is parsed via a deterministic grammar to its unique $\mathcal{L}$ -formula (e.g., "Every reptile is not herbivorous" $\mapsto \forall x[\mathrm{reptile}(x)\rightarrow\neg\mathrm{herbivorous}(x)]$ ).
Proof Validation: Using the context axioms and previously derived facts, each proof step $\varphi_i$ $φ_{i}$ is rigorously evaluated:
- Validity: Step is strictly-valid (due only to Ax or Hop), broadly-valid (permitted by transitivity, not explicitly in gold ontology), or invalid (unlicensed).
- Atomicity: Step is atomic (one Hop) or non-atomic (skips intermediates).
- Utility: Step is correct (progresses toward gold conclusion) or misleading (diverges to dead ends).
Proof/Label Metrics: Overall proof quality is scored along axes of strict proof-accuracy (all steps atomic, strictly-valid, non-misleading), skip accuracy (allows non-atomic, strictly-valid steps), broad accuracy, and valid accuracy (any strictly or broadly valid steps).

This infrastructure supports precise ablation analyses of LLM reasoning properties at a granular proof-step level.

4. Dataset Organization, Experimentation, and Complexity Controls

PrOntoQA is structured to systematically probe LLM deductive capabilities under varying reasoning demands:

Experimental Variable	Values	Description
Ontology Type	Fictional, “true”, “false”	Fictional ontologies, real-world grounded, or inconsistent
Number of Hops ( $k$ )	1, 3, 5	Controls proof depth, from immediate to multi-step deduction
Context Order	Bottom-up, Top-down	Whether context sentences follow proof order or are reversed
Prompting Regime	8-shot in-context learning	Each input: 8 labeled training examples plus 1 new test instance
Path/Distractor Structure	Linear chain plus distractor	Ensures no trivial string-matching shortcuts
Total Settings	48	All combinations of above
Test Instances per Setting	400	Uniform sampling
Total Test Instances	48 × 400 ≈ 19,200	Full test set size
Predicate Vocabulary Size	Up to 2,000 (fictional), 3 (true)	Varies per ontology, per path length

Proof structures are strictly linear chains of Hops, with $k$ uniformly distributed across the dataset. Controlled complexity is achieved by manipulating $k$ and context order, spanning from trivial (depth-1, bottom-up) to challenging (depth-5, top-down, fictional names). Each instance ensures unambiguous FOL structure, with deterministic ground truth.

5. Illustrative Worked Example

A canonical example demonstrates the stepwise structure and evaluation methodology:

3-Hop Fictional Ontology (Bottom-Up Order):

Context:

1. Fae is a cat. 2. Cats are carnivores. 3. Carnivores are reptiles. 4. Reptiles are not herbivorous.

Distractor: Zebrites are herbivorous.
Query: True or false: Fae is herbivorous.
Gold CoT:
- a. Fae is a cat.
- b. Every cat is a carnivore.
- c. Fae is a carnivore.
- d. Every carnivore is a reptile.
- e. Fae is a reptile.
- f. Every reptile is not herbivorous.
- g. Fae is not herbivorous.
Label: False
Formal Proof:

1. Ax: $\mathrm{cat}(\mathrm{fae})$ 2. Hop: $\forall x[\mathrm{cat}(x)\rightarrow\mathrm{carnivore}(x)]$ $\Rightarrow\ \mathrm{carnivore}(\mathrm{fae})$ 3. Hop: $\forall x[\mathrm{carnivore}(x)\rightarrow\mathrm{reptile}(x)]$ $\Rightarrow\ \mathrm{reptile}(\mathrm{fae})$ 4. Hop: $\forall x[\mathrm{reptile}(x)\rightarrow\neg\mathrm{herbivorous}(x)]$ $\Rightarrow\ \neg\mathrm{herbivorous}(\mathrm{fae})$

Each instance thus is fully specified and unambiguous with respect to both language and formal logic.

6. Model Performance Patterns and Error Analysis

Empirical analysis using GPT-3 variants and InstructGPT uncovers key LLM reasoning signatures:

Only the largest model (e.g., text-davinci-002) reliably outperforms chance; smaller models frequently default to random guessing.
On "true" ontologies, models exhibit high accuracy (≈85–90% for $k=5$ ) due to possible memorization and shortcut-skipping of deductive steps.
Deeper or fictional/inconsistent ontologies result in degraded performance: accuracy drops from ≈90% at $k=1$ to ≈50–60% at $k=5$ , particularly under top-down context order.
Most generated CoT steps (>90%) are strictly-valid modus ponens hops, regardless of whether the entire proof is correct.
Error analysis reveals that the dominant failure mode is strictly-valid but misleading atomic steps: models "greedily" choose a next step and cannot backtrack if the path is suboptimal (proof planning bottleneck).
Neither self-consistency sampling nor explicit in-context demonstration of depth-first search mitigates planning errors; LLMs rarely explore alternative deduction branches.

This suggests that contemporary LLMs are highly effective at local deduction but lack global proof-planning competence, particularly when multiple valid inference paths exist.

7. Significance and Research Implications

PrOntoQA advances benchmark methodology for formal reasoning evaluation in several respects:

Every example admits a programmatically generated, exact FOL proof, allowing complete transparency and unambiguous stepwise error attribution.
The dataset supports granular metrics (atomicity, validity, and utility per step), exposing differentiated error sources—notably step-level vs. planning-level failures.
By varying ontology vocabulary and order, PrOntoQA systematically isolates memorization, lexical shortcutting, and true deductive ability.
The findings reveal a dissociation between LLMs’ local deduction accuracy and global proof-plan construction, establishing a new controlled paradigm for future interventions in model reasoning architecture and training (Saparov et al., 2022).

A plausible implication is that further improvements in formal reasoning will require architectures or training regimes that explicitly model proof search, rather than incrementally accumulating strictly valid but potentially misleading steps.

References

"LLMs Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought" (Saparov et al., 2022)

Markdown Upgrade to Chat

References (1)

Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PrOntoQA Dataset.

PrOntoQA: Synthetic QA for Formal Reasoning

1. Formal Ontological World Model

2. Dataset Construction and Data Generation Protocol

3. Chain-of-Thought Parsing and Proof Reconstruction

4. Dataset Organization, Experimentation, and Complexity Controls

5. Illustrative Worked Example

6. Model Performance Patterns and Error Analysis

7. Significance and Research Implications

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

PrOntoQA: Synthetic QA for Formal Reasoning

1. Formal Ontological World Model

2. Dataset Construction and Data Generation Protocol

3. Chain-of-Thought Parsing and Proof Reconstruction

4. Dataset Organization, Experimentation, and Complexity Controls

5. Illustrative Worked Example

6. Model Performance Patterns and Error Analysis

7. Significance and Research Implications

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research