Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Self-enhancement Approach for Domain-specific Chatbot Training via Knowledge Mining and Digest (2311.10614v1)

Published 17 Nov 2023 in cs.CL and cs.AI

Abstract: LLMs, despite their great power in language generation, often encounter challenges when dealing with intricate and knowledge-demanding queries in specific domains. This paper introduces a novel approach to enhance LLMs by effectively extracting the relevant knowledge from domain-specific textual sources, and the adaptive training of a chatbot with domain-specific inquiries. Our two-step approach starts from training a knowledge miner, namely LLMiner, which autonomously extracts Question-Answer pairs from relevant documents through a chain-of-thought reasoning process. Subsequently, we blend the mined QA pairs with a conversational dataset to fine-tune the LLM as a chatbot, thereby enriching its domain-specific expertise and conversational capabilities. We also developed a new evaluation benchmark which comprises four domain-specific text corpora and associated human-crafted QA pairs for testing. Our model shows remarkable performance improvement over generally aligned LLM and surpasses domain-adapted models directly fine-tuned on domain corpus. In particular, LLMiner achieves this with minimal human intervention, requiring only 600 seed instances, thereby providing a pathway towards self-improvement of LLMs through model-synthesized training data.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Ruohong Zhang (11 papers)
  2. Luyu Gao (26 papers)
  3. Chen Zheng (52 papers)
  4. Zhen Fan (21 papers)
  5. Guokun Lai (16 papers)
  6. Zheng Zhang (486 papers)
  7. Fangzhou Ai (8 papers)
  8. Yiming Yang (151 papers)
  9. Hongxia Yang (130 papers)
Citations (2)