- The paper presents a hybrid methodology that integrates LLMs with symbolic reasoning to construct verifiable expert systems.
- It introduces a recursive prompt-chaining process that converts JSON responses into Prolog facts and rules, ensuring transparency.
- Empirical evaluation shows factual accuracy over 99% across diverse domains, demonstrating reliable and explainable AI outputs.
Integrating LLMs and Symbolic Reasoning: A Hybrid Approach to Expert System Construction
The paper "GOFAI meets Generative AI: Development of Expert Systems by means of LLMs" (2507.13550) presents a systematic methodology for constructing expert systems by leveraging the generative capabilities of LLMs and encoding their outputs into symbolic, logic-based representations. The approach is motivated by the persistent challenge of hallucinations and unverifiable outputs in LLMs, particularly in domains where factual accuracy and explainability are paramount.
Methodological Framework
The core contribution is a pipeline that extracts structured, domain-constrained knowledge from LLMs using recursive, prompt-based querying. The extracted knowledge is formalized as Prolog facts and rules, enabling transparent, interpretable, and verifiable expert systems. The process is parameterized by two hyperparameters: horizontal breadth (number of related concepts per node) and vertical depth (maximum expansion depth in the conceptual graph). The system operates as follows:
- Prompt Chaining and Extraction: For a given root concept, the system recursively queries the LLM for semantically related concepts and relations, using structured prompts tailored to the domain.
- Symbolic Encoding: The LLM's JSON-formatted responses are parsed and translated into Prolog predicates, with both core (e.g.,
concept/1, related_to/2, implies/2, causes/2) and domain-specific relations (e.g., written_by/2, developed_by/2).
- Deduplication and Validation: Lexical normalization and hashing ensure uniqueness of facts. Syntax validation is performed using SWI-Prolog, guaranteeing that the generated knowledge bases are executable.
- Human-in-the-Loop Verification: The symbolic representation allows domain experts to audit, correct, and extend the knowledge base, mitigating the risk of LLM hallucinations.
- Visualization: An automated tool renders the Prolog knowledge base as a directed graph, supporting qualitative inspection and structural validation.
Experimental Evaluation
The system was empirically evaluated using Claude Sonnet 3.7 and GPT-4.1, focusing on both factual accuracy and structural expressiveness.
Quantitative Verification
A statistical hypothesis test was conducted on 250 randomly sampled Prolog assertions across 25 topics in history, literature, and philosophy. The results are as follows:
- Claude Sonnet 3.7: Achieved 99.2% factual accuracy (±1.1%), with a p-value of 1.6e-14, allowing strong rejection of the null hypothesis that accuracy is ≤80%.
- GPT-4.1: Achieved 99.6% factual accuracy (±0.8%), with a p-value of 4.9e-15. No statistically significant difference was found between the two models (p=0.56).
These results demonstrate that, when constrained by domain and validated by experts, LLM-derived knowledge bases can achieve high factual reliability.
Qualitative Analysis
The system's behavior was analyzed across varying expansion depths using the topic "Plato" as a case paper. The resulting knowledge graphs exhibited coherent ontological structures, capturing both core philosophical concepts and biographical data. Comparative analysis with different LLMs (e.g., Grok 3) revealed both overlap and divergence in extracted concepts, highlighting the potential for ensemble approaches or cross-model validation.
Executability
All generated Prolog files were successfully loaded and queried in SWI-Prolog, confirming the syntactic and semantic robustness of the pipeline across diverse topics and LLMs.
Implications and Future Directions
The hybrid methodology outlined in this work has several significant implications:
- Practical Deployment in Sensitive Domains: By combining the recall and generative breadth of LLMs with the precision and transparency of symbolic systems, the approach is well-suited for applications in medicine, law, education, and other fields where explainability and verifiability are non-negotiable.
- Human-AI Collaboration: The symbolic encoding facilitates expert oversight, enabling iterative refinement and correction of the knowledge base, and supporting deterministic, auditable inference.
- Scalability and Modularity: The system's parameterization allows for flexible adaptation to different domains and knowledge granularities, supporting both lightweight and richly structured expert systems.
- Mitigation of LLM Hallucinations: By constraining the domain and enforcing human validation, the approach directly addresses the epistemic limitations of LLMs, providing a practical pathway to trustworthy AI.
Potential avenues for future research include:
- Cross-Model Knowledge Aggregation: Quantifying the incremental value and diversity of knowledge extracted from different LLMs, and developing methods for entropy-based selection or fusion.
- Automated Fact-Checking and Correction: Integrating external knowledge sources and automated verification tools to further reduce the human validation burden.
- Extension to Other Symbolic Formalisms: Adapting the pipeline to support alternative logic programming languages or ontological frameworks, broadening applicability.
Conclusion
This work demonstrates that the integration of LLMs and symbolic reasoning systems is not only feasible but also effective for constructing reliable, interpretable expert systems. The empirical results—showing factual accuracy exceeding 99% after expert validation—underscore the viability of this hybrid paradigm. The methodology provides a scalable, transparent, and domain-adaptable framework for deploying AI in contexts where correctness and explainability are essential, and sets a foundation for further advances in the intersection of generative and symbolic AI.