Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models (2403.18365v1)

Published 27 Mar 2024 in cs.CL
BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models

Abstract: LLMs like ChatGPT and GPT-4 are versatile and capable of addressing a diverse range of tasks. However, general LLMs, which are developed on open-domain data, may lack the domain-specific knowledge essential for tasks in vertical domains, such as legal, medical, etc. To address this issue, previous approaches either conduct continuous pre-training with domain-specific data or employ retrieval augmentation to support general LLMs. Unfortunately, these strategies are either cost-intensive or unreliable in practical applications. To this end, we present a novel framework named BLADE, which enhances Black-box LLMs with small Domain-spEcific models. BLADE consists of a black-box LLM and a small domain-specific LM. The small LM preserves domain-specific knowledge and offers specialized insights, while the general LLM contributes robust language comprehension and reasoning capabilities. Specifically, our method involves three steps: 1) pre-training the small LM with domain-specific data, 2) fine-tuning this model using knowledge instruction data, and 3) joint Bayesian optimization of the general LLM and the small LM. Extensive experiments conducted on public legal and medical benchmarks reveal that BLADE significantly outperforms existing approaches. This shows the potential of BLADE as an effective and cost-efficient solution in adapting general LLMs for vertical domains.

Enhancing General LLMs Performance in Domain-Specific Tasks with BLADE

Introduction to BLADE

The continuous evolution of LLMs such as GPT-3 and BERT has significantly enhanced the field of natural language processing. Despite their capability to generalize across a wide range of tasks, these models often stumble when applied to domain-specific tasks that require nuanced knowledge. Addressing this challenge, the recent work on BLADE (Black-box LLMs with small Domain-spEcific models) presents an innovative approach. Unlike conventional methods which require intensive resource consumption for fine-tuning or rely on unreliable retrieval augmentation, BLADE introduces a novel framework for leveraging domain-specific knowledge without such drawbacks.

Core Components of BLADE

BLADE consists of a trio of steps designed to impeccably blend the domain-specific insights from smaller LLMs (LMs) with the comprehensive reasoning abilities of a general LLM. These steps include:

  1. Domain-specific Pretraining (DP): Here the smaller LM is pretrained on a domain-specific dataset. This step aims at infusing the model with specialized knowledge peculiar to the domain of interest.
  2. Knowledge Instruction Tuning (KIT): Post-pretraining, this step further tunes the small LM for its ability to generate precise, question-tailored knowledge. This is achieved through producing and refining pseudo data using knowledge instructions.
  3. Bayesian Prompted Optimization (BPO): This final step leverages Bayesian optimization. Operating on soft embeddings, it aligns the output of the domain-specific small LM with the general black-box LLM, ensuring that the specialized knowledge is effectively utilized.

The systematic orchestration of these steps culminates in a framework where the detailed domain knowledge is seamlessly integrated with the broad reasoning capabilities of general LLMs.

Performance Evaluation and Results

Empirical evidence underscores the efficacy of BLADE across various benchmarks in the legal and medical fields. Public benchmarks like JEC-QA for legal domain tasks and MLEC-QA for medical domain tasks were utilized to assess the framework's performance.

Highlighted results include significant improvements over existing domain adaptation methods. For instance, when applied to general LLMs, BLADE consistently demonstrated superior performance across legal and medical benchmarks. These results particularly emphasize the framework's capability to optimize the use of domain-specific knowledge within the confines of general LLM functionalities.

Theoretical and Practical Implications

The BLADE approach embarks on a new paradigm in domain adaptation for LLMs. Theoretically, it delineates a boundary-less model where domain-specific LMs are not directly fined-tuned within large LLMs but rather co-opted through a knowledge embedding process. Practically, it offers a cost-efficient alternative to the extensive pre-training or re-training of LLMs on domain-specific datasets. By utilizing smaller domain-specific models, BLADE mitigates the risk of overfitting and reduces computational costs associated with the direct adaptation of LLMs.

Future Prospects in LLM Domain Adaptation

The success of BLADE opens new vistas in the domain adaptation of LLMs. In moving forward, one could explore enhancing the knowledge generation capacity of the small LMs or improving the synergy between the domain-specific and general LLM through more refined optimization techniques. Also, extending BLADE's application to more diverse domains and tasks could further validate its adaptability and scalability.

Conclusion

BLADE represents a notable advancement in the integration of domain-specific knowledge into general LLMs. Through its innovative three-step framework, it showcases an effective means of enhancing the performance of LLMs on domain-specific tasks without the conventional challenges of resource-intensive fine-tuning or unreliable retrieval augmentation. The promising results on legal and medical benchmarks reflect its potential as a viable solution for domain adaptation challenges in AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. 2023. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv:2211.05100 [cs.CL]
  2. Roee Aharoni and Yoav Goldberg. 2020. Unsupervised domain clusters in pretrained language models. arXiv preprint arXiv:2004.02105 (2020).
  3. LeanContext: Cost-Efficient Domain-Specific Question Answering Using LLMs. arXiv preprint arXiv:2309.00841 (2023).
  4. Qwen Technical Report. arXiv:2309.16609 [cs.CL]
  5. Baichuan. 2023. Baichuan 2: Open Large-scale Language Models. arXiv preprint arXiv:2309.10305 (2023). https://arxiv.org/abs/2309.10305
  6. Improving language models by retrieving from trillions of tokens. In International conference on machine learning. PMLR, 2206–2240.
  7. Ilias Chalkidis. 2023. ChatGPT may Pass the Bar Exam soon, but has a Long Way to Go for the LexGLUE benchmark. arXiv:2304.12202 [cs.CL]
  8. InstructZero: Efficient Instruction Optimization for Black-Box Large Language Models. arXiv preprint arXiv:2306.03082 (2023).
  9. Adapting Large Language Models via Reading Comprehension. arXiv:2309.09530 [cs.CL]
  10. PRE: A Peer Review Based Large Language Model Evaluator. arXiv preprint arXiv:2401.15641 (2024).
  11. ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases. arXiv:2306.16092 [cs.CL]
  12. Promptagator: Few-shot dense retrieval from 8 examples. arXiv preprint arXiv:2209.11755 (2022).
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  14. I3 Retriever: Incorporating Implicit Interaction in Pre-trained Language Models for Passage Retrieval. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 441–451.
  15. Aligning the Capabilities of Large Language Models with the Context of Information Retrieval via Contrastive Feedback. arXiv preprint arXiv:2309.17078 (2023).
  16. Incorporating Explicit Knowledge in Pre-trained Language Models for Passage Re-ranking. arXiv preprint arXiv:2204.11673 (2022).
  17. Qian Dong and Shuzi Niu. 2021a. Latent Graph Recurrent Network for Document Ranking. In International Conference on Database Systems for Advanced Applications. Springer, 88–103.
  18. Qian Dong and Shuzi Niu. 2021b. Legal judgment prediction via relational learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 983–992.
  19. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 320–335.
  20. Luciano Floridi and Massimo Chiriatti. 2020. GPT-3: Its nature, scope, limits, and consequences. Minds and Machines 30 (2020), 681–694.
  21. Peter I Frazier. 2018. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811 (2018).
  22. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964 (2020).
  23. Nikolaus Hansen. 2016. The CMA evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772 (2016).
  24. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  25. Lawyer LLaMA Technical Report. ArXiv abs/2305.15062 (2023).
  26. Contextualized representations using textual encyclopedic knowledge. arXiv preprint arXiv:2004.12006 (2020).
  27. Jon M Kleinberg. 1997. Two algorithms for nearest-neighbor search in high dimensions. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. 599–608.
  28. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
  29. Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets. arXiv:2008.02637 [cs.CL]
  30. Paq: 65 million probably-asked questions and what you can do with them. Transactions of the Association for Computational Linguistics 9 (2021), 1098–1115.
  31. SAILER: Structure-Aware Pre-Trained Language Model for Legal Case Retrieval (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 1035–1044. https://doi.org/10.1145/3539618.3591761
  32. Constructing Tree-based Index for Efficient and Effective Dense Retrieval. arXiv:2304.11943 [cs.IR]
  33. MLEC-QA: A Chinese multi-choice biomedical question answering dataset. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8862–8874.
  34. Prompting large language models for zero-shot domain adaptation in speech recognition. arXiv preprint arXiv:2306.16007 (2023).
  35. Towards General Text Embeddings with Multi-stage Contrastive Learning. arXiv:2308.03281 [cs.CL]
  36. Generated knowledge prompting for commonsense reasoning. arXiv preprint arXiv:2110.08387 (2021).
  37. Ilya Loshchilov and Frank Hutter. 2018. Fixing weight decay regularization in adam. (2018).
  38. Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks. arXiv preprint arXiv:2311.11608 (2023).
  39. MultiLegalPile: A 689GB Multilingual Legal Corpus. arXiv preprint arXiv:2306.02069 (2023).
  40. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  41. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
  42. Improving language understanding by generative pre-training. (2018).
  43. Donald B Rubin. 1980. Randomization analysis of experimental data: The Fisher randomization test comment. Journal of the American statistical association 75, 371 (1980), 591–593.
  44. Efficient domain adaptation of language models via adaptive tokenization. arXiv preprint arXiv:2109.07460 (2021).
  45. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652 (2023).
  46. Black-box tuning for language-model-as-a-service. In International Conference on Machine Learning. PMLR, 20841–20855.
  47. Recitation-augmented language models. arXiv preprint arXiv:2210.01296 (2022).
  48. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
  49. Elaboration-generating commonsense question answering at scale. arXiv preprint arXiv:2209.01232 (2022).
  50. He sicheng Wang Yuxin, Sun Qingxuan. 2023. M3E: Moka Massive Mixed Embedding Model.
  51. C-Pack: Packaged Resources To Advance General Chinese Embedding. arXiv preprint arXiv:2309.07597 (2023).
  52. T2Ranking: A Large-scale Chinese Benchmark for Passage Ranking. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (, Taipei, Taiwan,) (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 2681–2690. https://doi.org/10.1145/3539618.3591874
  53. Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue. arXiv:2308.03549 [cs.CL]
  54. Generate rather than retrieve: Large language models are strong context generators. arXiv preprint arXiv:2209.10063 (2022).
  55. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414 (2022).
  56. Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence. CoRR abs/2209.02970 (2022).
  57. When does pretraining help? assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings. In Proceedings of the eighteenth international conference on artificial intelligence and law. 159–168.
  58. JEC-QA: a legal-domain question answering dataset. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9701–9708.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Haitao Li (65 papers)
  2. Qingyao Ai (113 papers)
  3. Jia Chen (85 papers)
  4. Qian Dong (25 papers)
  5. Zhijing Wu (21 papers)
  6. Yiqun Liu (131 papers)
  7. Chong Chen (122 papers)
  8. Qi Tian (314 papers)
Citations (7)
Youtube Logo Streamline Icon: https://streamlinehq.com