Papers
Topics
Authors
Recent
Search
2000 character limit reached

ChatKBQA: A Generate-then-Retrieve Framework for Knowledge Base Question Answering with Fine-tuned Large Language Models

Published 13 Oct 2023 in cs.CL and cs.AI | (2310.08975v3)

Abstract: Knowledge Base Question Answering (KBQA) aims to answer natural language questions over large-scale knowledge bases (KBs), which can be summarized into two crucial steps: knowledge retrieval and semantic parsing. However, three core challenges remain: inefficient knowledge retrieval, mistakes of retrieval adversely impacting semantic parsing, and the complexity of previous KBQA methods. To tackle these challenges, we introduce ChatKBQA, a novel and simple generate-then-retrieve KBQA framework, which proposes first generating the logical form with fine-tuned LLMs, then retrieving and replacing entities and relations with an unsupervised retrieval method, to improve both generation and retrieval more directly. Experimental results show that ChatKBQA achieves new state-of-the-art performance on standard KBQA datasets, WebQSP, and CWQ. This work can also be regarded as a new paradigm for combining LLMs with knowledge graphs (KGs) for interpretable and knowledge-required question answering. Our code is publicly available.

Citations (25)

Summary

  • The paper introduces a generate-then-retrieve paradigm that leverages fine-tuned LLMs to significantly improve KBQA accuracy and reduce error propagation.
  • By generating logical forms prior to retrieval, the framework avoids noisy upstream extraction and simplifies the semantic parsing process.
  • Experimental results on WebQSP and ComplexWebQuestions demonstrate state-of-the-art F1 scores and robust logical form alignment with KB entities.

ChatKBQA: A Generate-then-Retrieve KBQA Framework with Fine-Tuned LLMs

Problem Formulation and Motivation

Knowledge Base Question Answering (KBQA) is concerned with answering natural language queries using large-scale structured knowledge bases (KBs), typically organized as knowledge graphs (KGs). The standard KBQA workflow is bifurcated into knowledge retrieval (identifying relevant entities, relations, triples) and semantic parsing (translating questions into executable logical forms, e.g., S-expressions or SPARQL queries). Traditionally, most approaches enforce a retrieve-then-generate pipeline, where retrieved knowledge elements lead semantic parsing. However, these suffer from three key limitations: poor retrieval efficiency, error propagation from noisy retrieval into parsing, and increased architectural complexity due to cascaded subtasks.

The ChatKBQA framework directly addresses these challenges by reversing the standard paradigm. Adopting a generate-then-retrieve approach, it exploits the capacity of fine-tuned LLMs to predict logical forms without dependency on upstream retrieval. This results in streamlined KBQA pipelines, improved retrieval efficacy, and more robust logical form generation.

ChatKBQA Architecture

ChatKBQA comprises four phases:

  1. Instruction-based LLM Fine-Tuning: Open-source LLMs (e.g., Llama-2, ChatGLM2, Baichuan2) are instruction-tuned on (question, logical form) pairs. Instead of using opaque entity IDs, surface forms (natural labels) are inserted in logical forms, aligning the format with LLM pretraining data and improving parsing performance. Efficient Parameter-Efficient Fine-Tuning (PEFT) techniques—LoRA, QLoRA, P-Tuning v2, Freeze—are used for tuning, permitting rapid adaptation with minimal resource consumption.
  2. Logical Form Generation: At inference, the fine-tuned LLM produces a logical form (S-expression) for a given question, typically as a "skeleton" with labeled placeholders rather than actual KB entity and relation identifiers. Empirically, the EM (exact match) between generated logical forms and ground-truth logical forms reached 63%. Beam search increased ground-truth inclusion to 74%, and when assessed on skeleton-level correspondences, it exceeded 91%.
  3. Unsupervised Retrieval and Alignment: Given the generated logical form, an unsupervised semantic retrieval process (e.g., using SimCSE, Contriever, or BM25) matches the placeholder entity/relation mentions in the candidate logical forms to entries in the KB's entity/relation sets. This occurs via phrase-level semantic similarity, with flexible top-k and thresholding strategies to ensure high-confidence alignments. This phase involves permutation, re-ranking, and replacement, with execution attempted sequentially over the logical form variants produced.
  4. Interpretable Query Execution: The resulting logical forms, which are now fully instantiated with KB-compatible entity and relation identifiers, are converted to SPARQL and executed against the KB. The first valid executable query provides the final answer and a fully interpretable reasoning trace.

Experimental Results

Comprehensive experiments were conducted on WebQSP and ComplexWebQuestions (CWQ)—canonical KBQA benchmarks built atop Freebase. ChatKBQA was systemically compared to recent KBQA systems spanning IR-based, SP-based, and hybrid methods, including RnG-KBQA, DECAF, GMT-KBQA, TIARA, FC-KBQA, and several methods leveraging open models and instruction tuning.

ChatKBQA achieved state-of-the-art results by a significant margin: On WebQSP, F1 jumped to 79.8 (non-oracle) and 83.5 (oracle entities), with Hits@1 and strict accuracy commensurately superior. On CWQ, the F1 was raised to 77.8 (non-oracle) and 81.3 (oracle entities). This is a 4–16 point improvement over previously leading systems across several metrics, especially with the challenging multi-hop CWQ dataset.

Ablation studies verified:

  • Logical form skeleton generation quality scales with training data volume.
  • Retrieval post-generation (rather than pre-generation) delivers superior downstream match, particularly as retrieval recall/precision increases with larger beams.
  • Removing the entity retrieval (ER) or relation retrieval (RR) stages led to measurable performance drops, with ER being more critical.

Generate-then-Retrieve vs. Retrieve-then-Generate

A key empirical finding is that retrieval prior to generation (the retrieve-then-generate methodology) degrades performance due to two factors: (1) errorful retrieval "poisons" semantic parsing, and (2) the need to encode large sets of retrieved triples expands context length, overwhelming LLMs and inducing catastrophic forgetting. In contrast, the generate-then-retrieve approach leverages the LLM's latent schema knowledge for structure prediction, then aligns entities/relations after the fact—both more efficient and robust, as shown by higher EM and skeleton match ratios.

Model Selection, Plug-and-Play Extensibility, and Retrieval Efficiency

ChatKBQA is demonstrated to support model-agnosticity in both LLM and retrieval components. Tuning Llama-2, Baichuan2, ChatGLM2, etc., or swapping SimCSE for Contriever/BM25 as the retrieval backend, yielded only modest performance variation. Efficient tuning mechanisms (LoRA, QLoRA) proved particularly beneficial, enabling the deployment of powerful 13B-parameter LLMs on node-level hardware.

Retrieval from generated logical forms (AG-R) produced both higher semantic similarity on relevant entities/relations and reduced solution space—contrasting sharply with traditional natural language retrieval (NL-R), which was hampered by ambiguity and the need for discrete entity/relation detection.

Implications, Theoretical Significance, and Future Directions

The ChatKBQA results validate several broad hypotheses about model-architecture co-design in hybrid neural-symbolic systems. By decoupling the symbolic and retrieval subtasks, the framework not only improves KBQA accuracy and robustness but also yields interpretable and executable outputs, addressing transparency desiderata in neuro-symbolic QA. The flexibility to update, exchange, and upgrade the LLM, tuning, and retrieval modules independently positions ChatKBQA as a platform for rapid prototyping and further progress in the KG-augmented LLM landscape.

The generate-then-retrieve principle may inspire similar architectures in other knowledge-grounded neural reasoning domains, including program synthesis, code retrieval, and multimodal KGQA. The modularity and direct LLM involvement in producing logical forms also facilitate integration with advanced KG-enhanced prompting, in-context learning, and continual learning protocols.

Conclusion

ChatKBQA introduces a paradigm shift in KBQA, empirically demonstrating that generation-first logical form prediction, combined with unsupervised semantic retrieval, enables substantial advances in accuracy, efficiency, and architectural simplicity. This work establishes new performance bounds for interpretable, KG-grounded question answering, while providing a modular, extensible framework for LLM-based symbolic reasoning research (2310.08975).

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.