From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT (2405.11040v1)
Abstract: LLMs have achieved remarkable progress, yet their application in specialized fields, such as medical physics, remains challenging due to the need for domain-specific knowledge. This study introduces ARCoT (Adaptable Retrieval-based Chain of Thought), a framework designed to enhance the domain-specific accuracy of LLMs without requiring fine-tuning or extensive retraining. ARCoT integrates a retrieval mechanism to access relevant domain-specific information and employs step-back and chain-of-thought prompting techniques to guide the LLM's reasoning process, ensuring more accurate and context-aware responses. Benchmarking on a medical physics multiple-choice exam, our model outperformed standard LLMs and reported average human performance, demonstrating improvements of up to 68% and achieving a high score of 90%. This method reduces hallucinations and increases domain-specific performance. The versatility and model-agnostic nature of ARCoT make it easily adaptable to various domains, showcasing its significant potential for enhancing the accuracy and reliability of LLMs in specialized fields.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Evaluating capabilities of large language models: Performance of gpt-4 on surgical knowledge assessments. Surgery, 2024.
- Tailoring large language models to radiology: A preliminary approach to llm adaptation for a highly specialized domain. In International Workshop on Machine Learning in Medical Imaging, pages 464–473. Springer, 2023.
- Betty van Aken. Exploration and adaptation of large language models for specialized domains. 2023.
- Evaluating large language models on a highly-specialized topic, radiation oncology physics. Frontiers in Oncology, 13, 2023.
- Hallucinations in neural machine translation. 2018.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
- A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922, 2023.
- Gpt-4 in a cancer center—institute-wide deployment challenges and lessons learned. NEJM AI, 1(4):AIcs2300191, 2024.
- A study of generative large language model for medical research and healthcare. NPJ Digital Medicine, 6(1):210, 2023.
- Fine-tuning large language models for domain-specific machine translation. arXiv preprint arXiv:2402.15061, 2024.
- How fine can fine-tuning be? learning efficient language models. In International Conference on Artificial Intelligence and Statistics, pages 2435–2443. PMLR, 2020.
- Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305, 2020.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
- Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235, 2023.
- Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307, 2023a.
- An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747, 2023.
- An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
- Retrieval augmented generation: Streamlining the creation of intelligent natural language processing models. URL https://research.facebook.com/file/4283170945104179/Retrieval-Augmented-Generation-for-Knowledge-Intensive-NLP-Tasks.pdf.
- Ragtruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. arXiv preprint arXiv:2401.00396, 2023.
- A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313, 2024.
- Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024.
- Unleashing the potential of prompt engineering in large language models: a comprehensive review. arXiv preprint arXiv:2310.14735, 2023b.
- Prompt engineering in large language models. In International Conference on Data Intelligence and Cognitive Informatics, pages 387–402. Springer, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
- Cue-cot: Chain-of-thought prompting for responding to in-depth dialogue questions with llms. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Chain-of-thought prompting for responding to in-depth dialogue questions with llm. arXiv preprint arXiv:2305.11792, 2023.
- Unstructured-IO. Unstructured-io/unstructured: Open source libraries and apis to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines. URL https://github.com/Unstructured-IO/unstructured.
- Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023.
- Take a step back: Evoking reasoning via abstraction in large language models. arXiv preprint arXiv:2310.06117, 2023.
- Blended rag: Improving rag (retriever-augmented generation) accuracy with semantic search and hybrid query-based retrievers. arXiv preprint arXiv:2404.07220, 2024.
- Semantic compression with large language models. In 2023 Tenth International Conference on Social Networks Analysis, Management and Security (SNAMS), pages 1–8. IEEE, 2023.
- Jace Grandinetti (2 papers)
- Rafe Mcbeth (11 papers)