GOFAI meets Generative AI: Development of Expert Systems by means of Large Language Models (2507.13550v1)

Published 17 Jul 2025 in cs.AI, cs.CL, and cs.SC

Abstract: The development of LLMs has successfully transformed knowledge-based systems such as open domain question nswering, which can automatically produce vast amounts of seemingly coherent information. Yet, those models have several disadvantages like hallucinations or confident generation of incorrect or unverifiable facts. In this paper, we introduce a new approach to the development of expert systems using LLMs in a controlled and transparent way. By limiting the domain and employing a well-structured prompt-based extraction approach, we produce a symbolic representation of knowledge in Prolog, which can be validated and corrected by human experts. This approach also guarantees interpretability, scalability and reliability of the developed expert systems. Via quantitative and qualitative experiments with Claude Sonnet 3.7 and GPT-4.1, we show strong adherence to facts and semantic coherence on our generated knowledge bases. We present a transparent hybrid solution that combines the recall capacity of LLMs with the precision of symbolic systems, thereby laying the foundation for dependable AI applications in sensitive domains.

Summary

The paper presents a hybrid methodology that integrates LLMs with symbolic reasoning to construct verifiable expert systems.
It introduces a recursive prompt-chaining process that converts JSON responses into Prolog facts and rules, ensuring transparency.
Empirical evaluation shows factual accuracy over 99% across diverse domains, demonstrating reliable and explainable AI outputs.

Integrating LLMs and Symbolic Reasoning: A Hybrid Approach to Expert System Construction

The paper "GOFAI meets Generative AI: Development of Expert Systems by means of LLMs" (2507.13550) presents a systematic methodology for constructing expert systems by leveraging the generative capabilities of LLMs and encoding their outputs into symbolic, logic-based representations. The approach is motivated by the persistent challenge of hallucinations and unverifiable outputs in LLMs, particularly in domains where factual accuracy and explainability are paramount.

Methodological Framework

The core contribution is a pipeline that extracts structured, domain-constrained knowledge from LLMs using recursive, prompt-based querying. The extracted knowledge is formalized as Prolog facts and rules, enabling transparent, interpretable, and verifiable expert systems. The process is parameterized by two hyperparameters: horizontal breadth (number of related concepts per node) and vertical depth (maximum expansion depth in the conceptual graph). The system operates as follows:

Prompt Chaining and Extraction: For a given root concept, the system recursively queries the LLM for semantically related concepts and relations, using structured prompts tailored to the domain.
Symbolic Encoding: The LLM's JSON-formatted responses are parsed and translated into Prolog predicates, with both core (e.g., concept/1, related_to/2, implies/2, causes/2) and domain-specific relations (e.g., written_by/2, developed_by/2).
Deduplication and Validation: Lexical normalization and hashing ensure uniqueness of facts. Syntax validation is performed using SWI-Prolog, guaranteeing that the generated knowledge bases are executable.
Human-in-the-Loop Verification: The symbolic representation allows domain experts to audit, correct, and extend the knowledge base, mitigating the risk of LLM hallucinations.
Visualization: An automated tool renders the Prolog knowledge base as a directed graph, supporting qualitative inspection and structural validation.

Experimental Evaluation

The system was empirically evaluated using Claude Sonnet 3.7 and GPT-4.1, focusing on both factual accuracy and structural expressiveness.

Quantitative Verification

A statistical hypothesis test was conducted on 250 randomly sampled Prolog assertions across 25 topics in history, literature, and philosophy. The results are as follows:

Claude Sonnet 3.7: Achieved 99.2% factual accuracy (±1.1%), with a p-value of 1.6e-14, allowing strong rejection of the null hypothesis that accuracy is ≤80%.
GPT-4.1: Achieved 99.6% factual accuracy (±0.8%), with a p-value of 4.9e-15. No statistically significant difference was found between the two models (p=0.56).

These results demonstrate that, when constrained by domain and validated by experts, LLM-derived knowledge bases can achieve high factual reliability.

Qualitative Analysis

The system's behavior was analyzed across varying expansion depths using the topic "Plato" as a case paper. The resulting knowledge graphs exhibited coherent ontological structures, capturing both core philosophical concepts and biographical data. Comparative analysis with different LLMs (e.g., Grok 3) revealed both overlap and divergence in extracted concepts, highlighting the potential for ensemble approaches or cross-model validation.

Executability

All generated Prolog files were successfully loaded and queried in SWI-Prolog, confirming the syntactic and semantic robustness of the pipeline across diverse topics and LLMs.

Implications and Future Directions

The hybrid methodology outlined in this work has several significant implications:

Practical Deployment in Sensitive Domains: By combining the recall and generative breadth of LLMs with the precision and transparency of symbolic systems, the approach is well-suited for applications in medicine, law, education, and other fields where explainability and verifiability are non-negotiable.
Human-AI Collaboration: The symbolic encoding facilitates expert oversight, enabling iterative refinement and correction of the knowledge base, and supporting deterministic, auditable inference.
Scalability and Modularity: The system's parameterization allows for flexible adaptation to different domains and knowledge granularities, supporting both lightweight and richly structured expert systems.
Mitigation of LLM Hallucinations: By constraining the domain and enforcing human validation, the approach directly addresses the epistemic limitations of LLMs, providing a practical pathway to trustworthy AI.

Potential avenues for future research include:

Cross-Model Knowledge Aggregation: Quantifying the incremental value and diversity of knowledge extracted from different LLMs, and developing methods for entropy-based selection or fusion.
Automated Fact-Checking and Correction: Integrating external knowledge sources and automated verification tools to further reduce the human validation burden.
Extension to Other Symbolic Formalisms: Adapting the pipeline to support alternative logic programming languages or ontological frameworks, broadening applicability.

Conclusion

This work demonstrates that the integration of LLMs and symbolic reasoning systems is not only feasible but also effective for constructing reliable, interpretable expert systems. The empirical results—showing factual accuracy exceeding 99% after expert validation—underscore the viability of this hybrid paradigm. The methodology provides a scalable, transparent, and domain-adaptable framework for deploying AI in contexts where correctness and explainability are essential, and sets a foundation for further advances in the intersection of generative and symbolic AI.