- The paper introduces a two-phase pipeline that extracts events from historical narratives and formalizes them using Coq proof assistant.
- It demonstrates that pure base generation outperforms RAG-enhanced approaches for stronger models while weaker models need external scaffolding.
- The work overcomes RDF/OWL limitations by converting extracted RDF representations into Coq, enabling multi-step causal reasoning and formal verification.
Reasoning with RAGged Events: RAG-Enhanced Event Knowledge Base Construction and Reasoning with Proof-Assistants
This paper (2506.07042) addresses the challenges of extracting structured representations of historical events from narrative text and reasoning about them. It introduces an approach that leverages LLMs, enhanced with knowledge graph information and RAG, to automatically construct historical event knowledge bases. The extracted RDF representations are then translated into Coq proof assistant specifications, enabling higher-order reasoning.
Methodology and Experimental Setup
The authors implement a two-phase pipeline. Phase 1 focuses on semantic event extraction from unstructured historical narratives, encompassing event boundary detection, agent identification, geographical entity resolution, temporal expression normalization, outcome extraction, and RDF knowledge graph construction. Phase 2 involves RDF-to-Coq inductive type conversion, higher-order temporal logic implementation, causal inference framework integration, and proof-assistant compatibility for formal verification. The methodology employs historical texts from Thucydides' History of the Peloponnesian War as a controlled domain. Three LLMs are used: GPT-4o, Claude-3.5 Sonnet, and Llama 3.2, each with three enhancement strategies: base generation, knowledge graph enhancement, and RAG. External knowledge retrieval includes Wikidata, DBpedia SPARQL endpoints, and the ConceptNet API.
Key Findings and the Inverse Calibration Principle
The paper reveals that enhancement strategies optimize different performance dimensions rather than providing universal improvements. An "inverse calibration principle" is observed, where enhancement effectiveness inversely correlates with model capability. Stronger models like GPT-4o and Claude-3.5 achieve superior performance through pure base generation, while weaker models like Llama 3.2 require external scaffolding but exhibit extreme sensitivity to implementation quality. Base generation excels in comprehensive historical coverage, while RAG enhancement improves coordinate accuracy and metadata completeness, trading breadth for technical precision. The Coq formalization validates that RAG-discovered event types represent legitimate domain-specific semantic structures.
Limitations of RDF/OWL Systems and the Coq Translation
The authors highlight the computational limitations of RDF/OWL systems, which are constrained to decidable subsets of first-order logic, limiting their ability to express and verify complex historical relationships. To overcome these limitations, they develop an automated translation pipeline that converts extracted RDF/Turtle representations into formal specifications for the Coq proof assistant. This translation unlocks analytical capabilities impossible within RDF frameworks, such as multi-step causal reasoning and formal verification of historical propositions.
Implications and Future Directions
The paper challenges the assumption that more comprehensive retrieval necessarily leads to better performance, demonstrating that optimal RAG design requires careful evaluation of whether external enhancement is necessary. The discovery that pure inferential generation achieves superior overall performance compared to enhanced RAG configurations has significant implications for the field. Future work should explore generalization across domains and historical periods, investigate hybrid approaches, and develop accessible interfaces for formal verification.