Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine (2311.16452v1)

Published 28 Nov 2023 in cs.CL
Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

Abstract: Generalist foundation models such as GPT-4 have displayed surprising capabilities in a wide variety of domains and tasks. Yet, there is a prevalent assumption that they cannot match specialist capabilities of fine-tuned models. For example, most explorations to date on medical competency benchmarks have leveraged domain-specific training, as exemplified by efforts on BioGPT and Med-PaLM. We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training. Rather than using simple prompting to highlight the model's out-of-the-box capabilities, we perform a systematic exploration of prompt engineering. We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks. The prompting methods we explore are general purpose, and make no specific use of domain expertise, removing the need for expert-curated content. Our experimental design carefully controls for overfitting during the prompt engineering process. We introduce Medprompt, based on a composition of several prompting strategies. With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite. The method outperforms leading specialist models such as Med-PaLM 2 by a significant margin with an order of magnitude fewer calls to the model. Steering GPT-4 with Medprompt achieves a 27% reduction in error rate on the MedQA dataset over the best methods to date achieved with specialist models and surpasses a score of 90% for the first time. Beyond medical problems, we show the power of Medprompt to generalize to other domains and provide evidence for the broad applicability of the approach via studies of the strategy on exams in electrical engineering, machine learning, philosophy, accounting, law, nursing, and clinical psychology.

In the field of artificial intelligence, there is a growing interest in whether AI models that are not specifically trained for a particular domain can perform as well as, or even outperform, those that are. A notable area of this investigation is the medical domain, where reliable and accurate information is critical.

A paper has shown that generalist models like GPT-4 can indeed exhibit expertise in specialized domains like medicine through innovative "prompt engineering." Prompt engineering is a method of carefully designing queries to elicit the most knowledgeable and specific responses from AI models.

GPT-4 has been indicated to achieve state-of-the-art results in medical question-answering benchmarks. This performance entails answering medical exam-style questions, akin to those used in the Medical Licensing Examination, at remarkably high accuracy rates. These results are surprising because, traditionally, it was assumed that models like GPT-4 would not demonstrate specialist capabilities without either being fine-tuned on domain-specific data or being given prompts created by experts in the domain.

A novel prompting strategy, called Medprompt, was introduced in the paper. Here's how it works: First, the model identifies and uses training examples that are most similar to the task at hand. Then, it uses these examples to generate reasoning steps logically leading up to an answer. These generated reasoning steps have proven to be very effective, even more so than those crafted by human experts in medicine. Finally, the model's answers are bolstered by shuffle ensembling - a method that shuffle the answer choices to ensure consistent outputs.

The remarkable aspect of the paper's findings is not only GPT-4's high performance in the medical field but also that the used prompting methods are general and not specifically designed for the medical domain. Further tests demonstrated that similar methods could be applied to multiple domains, including engineering, law, philosophy, and nursing, showing improvements over basic model querying methods.

Despite its groundbreaking results, the research emphasizes caution. Indeed, strong benchmark performances on massive, indiscriminately-sourced data may not directly translate to real-world applicability or accuracy. There also remains the risk of the AI model providing incorrect information or 'hallucinations,' which might be amplified by better prompting strategies – a critical consideration especially in high-stakes fields such as medicine.

In conclusion, the paper presents an intriguing insight into the capabilities of generalist AI models. By using sophisticated prompting strategies, GPT-4's latent specialty knowledge can be unlocked, potentially reducing or even eliminating the need for expensive domain-specific model training or the manual crafting of prompts by specialists. The methodology has broad implications for the application of AI across various fields, not just in the field of medicine. However, the implementation of such technology requires careful consideration and thorough examination of its limitations, especially in contexts where accuracy is a matter of utmost concern.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Niels J. Blunch. Position bias in multiple-choice questions. Journal of Marketing Research, 21(2):216–220, 1984.
  2. Biomedlm, 2022. Stanford Center for Research on Foundation Models.
  3. Language models are few-shot learners, 2020.
  4. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  5. Ensemble selection from libraries of models. In Proceedings of the twenty-first international conference on Machine learning, page 18, 2004.
  6. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  7. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797, 2023.
  8. Hyperparameter optimization. Automated machine learning: Methods, systems, challenges, pages 3–33, 2019.
  9. Self-verification improves few-shot clinical information extraction. In ICML 3rd Workshop on Interpretable Machine Learning in Healthcare (IMLH), 2023.
  10. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
  11. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  12. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  13. Addressing bias in machine learning algorithms: A pilot study on emotion recognition for intelligent systems. 2017 IEEE Workshop on Advanced Robotics and its Social Impacts (ARSO), pages 1–7, 2017.
  14. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021.
  15. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146, 2019.
  16. Look at the first sentence: Position bias in question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1109–1121, Online, November 2020. Association for Computational Linguistics.
  17. Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLoS digital health, 2(2):e0000198, 2023.
  18. What makes good in-context examples for gpt-3333?, 2021.
  19. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6):bbac409, 2022.
  20. Assessing the fairness of ai systems: AI practitioners’ processes, challenges, and needs for support. In 25th ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW 2022), February 2022.
  21. A proposal for the Dartmouth summer research project on artificial intelligence, August 31, 1955. AI magazine, 27(4):12–12, 2006.
  22. Report on a general problem solving program. In IFIP congress, volume 256, page 64. Pittsburgh, PA, 1959.
  23. Capabilities of GPT-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023.
  24. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  25. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning, pages 248–260. PMLR, 2022.
  26. Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483, 2023.
  27. Are emergent abilities of large language models a mirage?, 2023.
  28. Lloyd S Shapley et al. A value for n-person games. 1953.
  29. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138, 2022.
  30. Towards expert-level medical question answering with large language models, 2023.
  31. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014.
  32. Self-consistency improves chain of thought reasoning in language models, 2023.
  33. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. Survey Certification.
  34. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  35. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
  36. Deep bidirectional language-knowledge graph pretraining. In Neural Information Processing Systems (NeurIPS), 2022.
  37. Linkbert: Pretraining language models with document links. In Association for Computational Linguistics (ACL), 2022.
  38. Coding inequity: Assessing gpt-4’s potential for perpetuating racial and gender biases in healthcare. medRxiv, 2023.
  39. Large language models are not robust multiple choice selectors, 2023.
  40. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (18)
  1. Harsha Nori (23 papers)
  2. Yin Tat Lee (102 papers)
  3. Sheng Zhang (212 papers)
  4. Dean Carignan (4 papers)
  5. Richard Edgar (7 papers)
  6. Nicolo Fusi (26 papers)
  7. Nicholas King (4 papers)
  8. Jonathan Larson (23 papers)
  9. Yuanzhi Li (119 papers)
  10. Weishung Liu (3 papers)
  11. Renqian Luo (19 papers)
  12. Scott Mayer McKinney (8 papers)
  13. Robert Osazuwa Ness (6 papers)
  14. Hoifung Poon (61 papers)
  15. Tao Qin (201 papers)
  16. Naoto Usuyama (22 papers)
  17. Chris White (7 papers)
  18. Eric Horvitz (76 papers)
Citations (231)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com