Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering (2305.11541v3)

Published 19 May 2023 in cs.CL and cs.AI
Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering

Abstract: LLM has gained popularity and achieved remarkable results in open-domain tasks, but its performance in real industrial domain-specific scenarios is average due to its lack of specific domain knowledge. This issue has attracted widespread attention, but there are few relevant benchmarks available. In this paper, we provide a benchmark Question Answering (QA) dataset named MSQA, centered around Microsoft products and IT technical problems encountered by customers. This dataset contains industry cloud-specific QA knowledge, an area not extensively covered in general LLMs, making it well-suited for evaluating methods aiming to enhance LLMs' domain-specific capabilities. In addition, we propose a new model interaction paradigm that can empower LLM to achieve better performance on domain-specific tasks where it is not proficient. Extensive experiments demonstrate that the approach following our method outperforms the commonly used LLM with retrieval methods. We make our source code and sample data available at: https://aka.ms/Microsoft_QA.

Introduction

A recent publication in the field of AI and natural language processing addresses the challenges LLMs face when dealing with domain-specific problems. Despite their vast knowledge and remarkable performance in various open-domain tasks, these models often fall short when it comes to domain-specific question answering (QA) due to their limited pretraining on specialized knowledge. This gap in performance has led to a surge of interest in methods that can fine-tune and improve LLMs' abilities in such contexts.

MSQA Dataset Creation

The researchers introduced a benchmark dataset called MSQA, concentrating on Microsoft products and IT technical issues. The dataset contains 32,000 QA pairs and is designed to test and enhance LLMs' domain-specific abilities. The MSQA highlights an area that is not extensively covered by general LLMs, specifically for evaluating industrial-domain question answering scenarios. The paper also notes the high cost and potential risks associated with data leakage during fine-tuning of LLMs, as access to domain-specific data is often limited and confidential.

Methodology

The proposed approach involves pre-training smaller LLMs on domain documentation to instill domain-specific knowledge. Subsequently, the model is fine-tuned using instruction tuning with an emphasis on QA tasks, leveraging the domain knowledge gained. The fine-tuned domain-specific model then assists the general LLM by providing relevant domain-specific information during runtime. This interaction paradigm circumvents the need for traditional data retrieval methods, making it easier to maintain privacy while staying updated with domain knowledge.

Experiment and Results

Through comprehensive experimentation, the proposed model interaction paradigm demonstrated enhanced performance over traditional retrieval-based methods when measured against standard and new evaluation metrics. The authors also introduced new metrics tailored for long-form QA tasks that align better with human evaluations. Importantly, the method showed significant improvements in generating contextually accurate domain-specific answers. The researchers have made the source code and sample data publicly available to foster further research in empowering LLMs within specific industrial domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Auto-GPT. 2023. Auto-gpt: An autonomous gpt-4 experiment. https://github.com/Significant-Gravitas/Auto-GPT. Accessed: 2023-05-15.
  2. Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  3. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  4. Cayque Monteiro Castro Nascimento and André Silva Pimentel. 2023. Do large language models understand chemistry? a conversation with chatgpt. Journal of Chemical Information and Modeling, 63(6):1649–1655.
  5. Finqa: A dataset of numerical reasoning over financial data. arXiv preprint arXiv:2109.00122.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  7. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  8. S Morris Engel. 1982. With good reason: An introduction to informal fallacies.
  9. ELI5: long form question answering. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 3558–3567. Association for Computational Linguistics.
  10. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
  11. Lynette Hirschman and Robert Gaizauskas. 2001. Natural language question answering: the view from here. natural language engineering, 7(4):275–300.
  12. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299.
  13. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  14. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146.
  15. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 6769–6781. Association for Computational Linguistics.
  16. Hurdles to progress in long-form question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 4940–4957. Association for Computational Linguistics.
  17. Revolutionizing radiology with gpt-based models: Current applications, future possibilities and limitations of chatgpt. Diagnostic and Interventional Imaging.
  18. Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine, 388(13):1233–1239.
  19. Domain-general and domain-specific functional networks in working memory. Neuroimage, 102:646–656.
  20. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  21. Jerry Liu. 2022. LlamaIndex.
  22. Cecily Mauran. 2023. Whoops, samsung workers accidentally leaked trade secrets via chatgpt. https://mashable.com/article/samsung-chatgpt-leak-details. Accessed: 2023-05-15.
  23. Augmented language models: a survey. arXiv preprint arXiv:2302.07842.
  24. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  25. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  26. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  27. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning, pages 248–260. PMLR.
  28. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  29. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813.
  30. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  31. David E Penner and David Klahr. 1996. The interaction of domain-specific knowledge and domain-general discovery strategies: A study with sinking objects. Child development, 67(6):2709–2727.
  32. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
  33. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580.
  34. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652.
  35. Robert S Siegler. 1989. How domain-general and domain-specific knowledge interact to produce strategy choices. Merrill-Palmer Quarterly (1982-), pages 1–26.
  36. Read before generate! faithful long form question answering with machine reading. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 744–756. Association for Computational Linguistics.
  37. Rohan Taori. 2023. Alpaca: A strong, replicable instruction-following model. Accessed: 2023-03-13.
  38. The Vicuna Team. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/. Accessed: 2023-05-15.
  39. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  40. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
  41. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  42. Zhen Wang. 2022. Modern question answering datasets and benchmarks: A survey. arXiv preprint arXiv:2206.15030.
  43. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  44. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671.
  45. Bertscore: Evaluating text generation with bert.
  46. When does pretraining help? assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings. In Proceedings of the eighteenth international conference on artificial intelligence and law, pages 159–168.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Fangkai Yang (45 papers)
  2. Pu Zhao (82 papers)
  3. Zezhong Wang (30 papers)
  4. Lu Wang (329 papers)
  5. Jue Zhang (43 papers)
  6. Mohit Garg (15 papers)
  7. Qingwei Lin (81 papers)
  8. Saravan Rajmohan (85 papers)
  9. Dongmei Zhang (193 papers)
Citations (39)