Walking a Tightrope -- Evaluating Large Language Models in High-Risk Domains (2311.14966v1)
Abstract: High-risk domains pose unique challenges that require LLMs to provide accurate and safe responses. Despite the great success of LLMs, such as ChatGPT and its variants, their performance in high-risk domains remains unclear. Our study delves into an in-depth analysis of the performance of instruction-tuned LLMs, focusing on factual accuracy and safety adherence. To comprehensively assess the capabilities of LLMs, we conduct experiments on six NLP datasets including question answering and summarization tasks within two high-risk domains: legal and medical. Further qualitative analysis highlights the existing limitations inherent in current LLMs when evaluating in high-risk domains. This underscores the essential nature of not only improving LLM capabilities but also prioritizing the refinement of domain-specific metrics, and embracing a more human-centric approach to enhance safety and factual reliability. Our findings advance the field toward the concerns of properly evaluating LLMs in high-risk domains, aiming to steer the adaptability of LLMs in fulfilling societal obligations and aligning with forthcoming regulations, such as the EU AI Act.
- Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. https://github.com/nomic-ai/gpt4all. Accessed: 2023-06-03.
- Ai chatbots not yet ready for clinical use. Frontiers in Digital Health, 5:60.
- Improving language models by retrieving from trillions of tokens. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 2206–2240. PMLR.
- A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
- Next steps for human-centered generative ai: A technical perspective. arXiv preprint arXiv:2306.15774.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- On the limitations of reference-free evaluations of generated text. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10960–10977, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- SafetyKit: First aid for measuring safety in open-domain conversational systems. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4113–4133, Dublin, Ireland. Association for Computational Linguistics.
- Calibrating factual knowledge in pretrained language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5937–5947, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- European Commission. 2021. Proposal for a REGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL LAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE (ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION LEGISLATIVE ACTS, COM/2021/206 final. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52021PC0206; https://eur-lex.europa.eu/resource.html?uri=cellar:e0649735-a372-11eb-9585-01aa75ed71a1.0001.02/DOC_1&format=PDF; https://eur-lex.europa.eu/resource.html?uri=cellar:e0649735-a372-11eb-9585-01aa75ed71a1.0001.02/DOC_2&format=PDF. Accessed: 2023-06-22.
- QAFactEval: Improved QA-based factual consistency evaluation for summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2587–2601, Seattle, United States. Association for Computational Linguistics.
- Datasheets for datasets. Communications of the ACM, 64(12):86–92.
- Dr. llama: Improving small language models in domain-specific qa via generative data augmentation. arXiv preprint arXiv:2305.07804.
- InstructDial: Improving zero and few-shot generalization in dialogue through instruction tuning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 505–525, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3929–3938. PMLR.
- Laura Hanu and Unitary team. 2020. Detoxify. https://github.com/unitaryai/detoxify. Accessed: 2023-06-15.
- Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset. Advances in Neural Information Processing Systems, 35:29217–29234.
- Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933.
- State-of-the-art generalisation research in NLP: a taxonomy and review. CoRR.
- Multi-dimensional evaluation of text summarization with in-context learning. arXiv preprint arXiv:2306.01200.
- Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14).
- PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. Association for Computational Linguistics.
- Challenges and applications of large language models. arXiv preprint arXiv:2307.10169.
- shs-nlp at RadSum23: Domain-adaptive pre-training of instruction-tuned LLMs for radiology report impression generation. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 550–556, Toronto, Canada. Association for Computational Linguistics.
- Large language models are zero-shot reasoners. In ICML 2022 Workshop on Knowledge Retrieval and Language Models.
- Anastassia Kornilova and Vladimir Eidelman. 2019. BillSum: A corpus for automatic summarization of US legislation. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 48–56, Hong Kong, China. Association for Computational Linguistics.
- Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197.
- Parameter-efficient legal domain adaptation. In Proceedings of the Natural Legal Language Processing Workshop 2022, pages 119–129, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6):bbac409.
- Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland. Association for Computational Linguistics.
- Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, pages 220–229.
- Team MosaicML NLP. 2023. Introducing mpt-7b: A new standard for open-source, commercially usable llms. www.mosaicml.com/blog/mpt-7b. Accessed: 2023-06-03.
- Large language models as tax attorneys: A case study in legal capabilities emergence. arXiv preprint arXiv:2306.07075.
- Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375.
- OpenAI. 2022. chatgpt. https://openai.com/blog/chatgpt. Accessed: 2023-06-23.
- Improving language understanding by generative pre-training.
- Sandeep Reddy. 2023. Evaluating large language models for use in healthcare: A framework for translational value assessment. Informatics in Medicine Unlocked, page 101304.
- Malik Sallam. 2023. Chatgpt utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. In Healthcare, volume 11, page 887. MDPI.
- Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
- Fine-tuned language models are continual learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6107–6122, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Ben Shneiderman. 2020. Human-centered artificial intelligence: Reliable, safe & trustworthy. International Journal of Human–Computer Interaction, 36(6):495–504.
- Large language models encode clinical knowledge. Nature, pages 1–9.
- Generative artificial intelligence through chatgpt and other large language models in ophthalmology: Clinical applications and challenges. Ophthalmology Science, page 100394.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Generating (factual?) narrative summaries of rcts: Experiments with neural multi-document summarization. AMIA Annual Symposium, abs/2008.11293.
- Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax. Accessed: 2023-06-03.
- Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- Emergent abilities of large language models. Transactions on Machine Learning Research. Survey Certification.
- Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564.
- Pixiu: A large language model, instruction data and evaluation benchmark for finance. arXiv preprint arXiv:2306.05443.
- Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070.
- Explainability for large language models: A survey. arXiv preprint arXiv:2309.01029.
- When does pretraining help? assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, ICAIL ’21, page 159–168, New York, NY, USA. Association for Computing Machinery.
- Towards a unified multi-dimensional evaluator for text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2023–2038, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
- Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.