- The paper introduces a generate-then-retrieve paradigm that leverages fine-tuned LLMs to significantly improve KBQA accuracy and reduce error propagation.
- By generating logical forms prior to retrieval, the framework avoids noisy upstream extraction and simplifies the semantic parsing process.
- Experimental results on WebQSP and ComplexWebQuestions demonstrate state-of-the-art F1 scores and robust logical form alignment with KB entities.
ChatKBQA: A Generate-then-Retrieve KBQA Framework with Fine-Tuned LLMs
Knowledge Base Question Answering (KBQA) is concerned with answering natural language queries using large-scale structured knowledge bases (KBs), typically organized as knowledge graphs (KGs). The standard KBQA workflow is bifurcated into knowledge retrieval (identifying relevant entities, relations, triples) and semantic parsing (translating questions into executable logical forms, e.g., S-expressions or SPARQL queries). Traditionally, most approaches enforce a retrieve-then-generate pipeline, where retrieved knowledge elements lead semantic parsing. However, these suffer from three key limitations: poor retrieval efficiency, error propagation from noisy retrieval into parsing, and increased architectural complexity due to cascaded subtasks.
The ChatKBQA framework directly addresses these challenges by reversing the standard paradigm. Adopting a generate-then-retrieve approach, it exploits the capacity of fine-tuned LLMs to predict logical forms without dependency on upstream retrieval. This results in streamlined KBQA pipelines, improved retrieval efficacy, and more robust logical form generation.
ChatKBQA Architecture
ChatKBQA comprises four phases:
- Instruction-based LLM Fine-Tuning: Open-source LLMs (e.g., Llama-2, ChatGLM2, Baichuan2) are instruction-tuned on (question, logical form) pairs. Instead of using opaque entity IDs, surface forms (natural labels) are inserted in logical forms, aligning the format with LLM pretraining data and improving parsing performance. Efficient Parameter-Efficient Fine-Tuning (PEFT) techniques—LoRA, QLoRA, P-Tuning v2, Freeze—are used for tuning, permitting rapid adaptation with minimal resource consumption.
- Logical Form Generation: At inference, the fine-tuned LLM produces a logical form (S-expression) for a given question, typically as a "skeleton" with labeled placeholders rather than actual KB entity and relation identifiers. Empirically, the EM (exact match) between generated logical forms and ground-truth logical forms reached 63%. Beam search increased ground-truth inclusion to 74%, and when assessed on skeleton-level correspondences, it exceeded 91%.
- Unsupervised Retrieval and Alignment: Given the generated logical form, an unsupervised semantic retrieval process (e.g., using SimCSE, Contriever, or BM25) matches the placeholder entity/relation mentions in the candidate logical forms to entries in the KB's entity/relation sets. This occurs via phrase-level semantic similarity, with flexible top-k and thresholding strategies to ensure high-confidence alignments. This phase involves permutation, re-ranking, and replacement, with execution attempted sequentially over the logical form variants produced.
- Interpretable Query Execution: The resulting logical forms, which are now fully instantiated with KB-compatible entity and relation identifiers, are converted to SPARQL and executed against the KB. The first valid executable query provides the final answer and a fully interpretable reasoning trace.
Experimental Results
Comprehensive experiments were conducted on WebQSP and ComplexWebQuestions (CWQ)—canonical KBQA benchmarks built atop Freebase. ChatKBQA was systemically compared to recent KBQA systems spanning IR-based, SP-based, and hybrid methods, including RnG-KBQA, DECAF, GMT-KBQA, TIARA, FC-KBQA, and several methods leveraging open models and instruction tuning.
ChatKBQA achieved state-of-the-art results by a significant margin: On WebQSP, F1 jumped to 79.8 (non-oracle) and 83.5 (oracle entities), with Hits@1 and strict accuracy commensurately superior. On CWQ, the F1 was raised to 77.8 (non-oracle) and 81.3 (oracle entities). This is a 4–16 point improvement over previously leading systems across several metrics, especially with the challenging multi-hop CWQ dataset.
Ablation studies verified:
- Logical form skeleton generation quality scales with training data volume.
- Retrieval post-generation (rather than pre-generation) delivers superior downstream match, particularly as retrieval recall/precision increases with larger beams.
- Removing the entity retrieval (ER) or relation retrieval (RR) stages led to measurable performance drops, with ER being more critical.
Generate-then-Retrieve vs. Retrieve-then-Generate
A key empirical finding is that retrieval prior to generation (the retrieve-then-generate methodology) degrades performance due to two factors: (1) errorful retrieval "poisons" semantic parsing, and (2) the need to encode large sets of retrieved triples expands context length, overwhelming LLMs and inducing catastrophic forgetting. In contrast, the generate-then-retrieve approach leverages the LLM's latent schema knowledge for structure prediction, then aligns entities/relations after the fact—both more efficient and robust, as shown by higher EM and skeleton match ratios.
Model Selection, Plug-and-Play Extensibility, and Retrieval Efficiency
ChatKBQA is demonstrated to support model-agnosticity in both LLM and retrieval components. Tuning Llama-2, Baichuan2, ChatGLM2, etc., or swapping SimCSE for Contriever/BM25 as the retrieval backend, yielded only modest performance variation. Efficient tuning mechanisms (LoRA, QLoRA) proved particularly beneficial, enabling the deployment of powerful 13B-parameter LLMs on node-level hardware.
Retrieval from generated logical forms (AG-R) produced both higher semantic similarity on relevant entities/relations and reduced solution space—contrasting sharply with traditional natural language retrieval (NL-R), which was hampered by ambiguity and the need for discrete entity/relation detection.
Implications, Theoretical Significance, and Future Directions
The ChatKBQA results validate several broad hypotheses about model-architecture co-design in hybrid neural-symbolic systems. By decoupling the symbolic and retrieval subtasks, the framework not only improves KBQA accuracy and robustness but also yields interpretable and executable outputs, addressing transparency desiderata in neuro-symbolic QA. The flexibility to update, exchange, and upgrade the LLM, tuning, and retrieval modules independently positions ChatKBQA as a platform for rapid prototyping and further progress in the KG-augmented LLM landscape.
The generate-then-retrieve principle may inspire similar architectures in other knowledge-grounded neural reasoning domains, including program synthesis, code retrieval, and multimodal KGQA. The modularity and direct LLM involvement in producing logical forms also facilitate integration with advanced KG-enhanced prompting, in-context learning, and continual learning protocols.
Conclusion
ChatKBQA introduces a paradigm shift in KBQA, empirically demonstrating that generation-first logical form prediction, combined with unsupervised semantic retrieval, enables substantial advances in accuracy, efficiency, and architectural simplicity. This work establishes new performance bounds for interpretable, KG-grounded question answering, while providing a modular, extensible framework for LLM-based symbolic reasoning research (2310.08975).