Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
92 tokens/sec
Gemini 2.5 Pro Premium
50 tokens/sec
GPT-5 Medium
22 tokens/sec
GPT-5 High Premium
21 tokens/sec
GPT-4o
97 tokens/sec
DeepSeek R1 via Azure Premium
87 tokens/sec
GPT OSS 120B via Groq Premium
459 tokens/sec
Kimi K2 via Groq Premium
230 tokens/sec
2000 character limit reached

Reasoning with RAGged events: RAG-Enhanced Event Knowledge Base Construction and reasoning with proof-assistants (2506.07042v2)

Published 8 Jun 2025 in cs.CL

Abstract: Extracting structured computational representations of historical events from narrative text remains computationally expensive when constructed manually. While RDF/OWL reasoners enable graph-based reasoning, they are limited to fragments of first-order logic, preventing deeper temporal and semantic analysis. This paper addresses both challenges by developing automatic historical event extraction models using multiple LLMs (GPT-4, Claude, Llama 3.2) with three enhancement strategies: pure base generation, knowledge graph enhancement, and Retrieval-Augmented Generation (RAG). We conducted comprehensive evaluations using historical texts from Thucydides. Our findings reveal that enhancement strategies optimize different performance dimensions rather than providing universal improvements. For coverage and historical breadth, base generation achieves optimal performance with Claude and GPT-4 extracting comprehensive events. However, for precision, RAG enhancement improves coordinate accuracy and metadata completeness. Model architecture fundamentally determines enhancement sensitivity: larger models demonstrate robust baseline performance with incremental RAG improvements, while Llama 3.2 shows extreme variance from competitive performance to complete failure. We then developed an automated translation pipeline converting extracted RDF representations into Coq proof assistant specifications, enabling higher-order reasoning beyond RDF capabilities including multi-step causal verification, temporal arithmetic with BC dates, and formal proofs about historical causation. The Coq formalization validates that RAG-discovered event types represent legitimate domain-specific semantic structures rather than ontological violations.

Summary

  • The paper introduces a two-phase pipeline that extracts events from historical narratives and formalizes them using Coq proof assistant.
  • It demonstrates that pure base generation outperforms RAG-enhanced approaches for stronger models while weaker models need external scaffolding.
  • The work overcomes RDF/OWL limitations by converting extracted RDF representations into Coq, enabling multi-step causal reasoning and formal verification.

Reasoning with RAGged Events: RAG-Enhanced Event Knowledge Base Construction and Reasoning with Proof-Assistants

This paper (2506.07042) addresses the challenges of extracting structured representations of historical events from narrative text and reasoning about them. It introduces an approach that leverages LLMs, enhanced with knowledge graph information and RAG, to automatically construct historical event knowledge bases. The extracted RDF representations are then translated into Coq proof assistant specifications, enabling higher-order reasoning.

Methodology and Experimental Setup

The authors implement a two-phase pipeline. Phase 1 focuses on semantic event extraction from unstructured historical narratives, encompassing event boundary detection, agent identification, geographical entity resolution, temporal expression normalization, outcome extraction, and RDF knowledge graph construction. Phase 2 involves RDF-to-Coq inductive type conversion, higher-order temporal logic implementation, causal inference framework integration, and proof-assistant compatibility for formal verification. The methodology employs historical texts from Thucydides' History of the Peloponnesian War as a controlled domain. Three LLMs are used: GPT-4o, Claude-3.5 Sonnet, and Llama 3.2, each with three enhancement strategies: base generation, knowledge graph enhancement, and RAG. External knowledge retrieval includes Wikidata, DBpedia SPARQL endpoints, and the ConceptNet API.

Key Findings and the Inverse Calibration Principle

The paper reveals that enhancement strategies optimize different performance dimensions rather than providing universal improvements. An "inverse calibration principle" is observed, where enhancement effectiveness inversely correlates with model capability. Stronger models like GPT-4o and Claude-3.5 achieve superior performance through pure base generation, while weaker models like Llama 3.2 require external scaffolding but exhibit extreme sensitivity to implementation quality. Base generation excels in comprehensive historical coverage, while RAG enhancement improves coordinate accuracy and metadata completeness, trading breadth for technical precision. The Coq formalization validates that RAG-discovered event types represent legitimate domain-specific semantic structures.

Limitations of RDF/OWL Systems and the Coq Translation

The authors highlight the computational limitations of RDF/OWL systems, which are constrained to decidable subsets of first-order logic, limiting their ability to express and verify complex historical relationships. To overcome these limitations, they develop an automated translation pipeline that converts extracted RDF/Turtle representations into formal specifications for the Coq proof assistant. This translation unlocks analytical capabilities impossible within RDF frameworks, such as multi-step causal reasoning and formal verification of historical propositions.

Implications and Future Directions

The paper challenges the assumption that more comprehensive retrieval necessarily leads to better performance, demonstrating that optimal RAG design requires careful evaluation of whether external enhancement is necessary. The discovery that pure inferential generation achieves superior overall performance compared to enhanced RAG configurations has significant implications for the field. Future work should explore generalization across domains and historical periods, investigate hybrid approaches, and develop accessible interfaces for formal verification.