Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SYNFAC-EDIT: Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization (2402.13919v4)

Published 21 Feb 2024 in cs.CL and cs.AI

Abstract: LLMs such as GPT & Llama have demonstrated significant achievements in summarization tasks but struggle with factual inaccuracies, a critical issue in clinical NLP applications where errors could lead to serious consequences. To counter the high costs and limited availability of expert-annotated data for factual alignment, this study introduces an innovative pipeline that utilizes >100B parameter GPT variants like GPT-3.5 & GPT-4 to act as synthetic experts to generate high-quality synthetics feedback aimed at enhancing factual consistency in clinical note summarization. Our research primarily focuses on edit feedback generated by these synthetic feedback experts without additional human annotations, mirroring and optimizing the practical scenario in which medical professionals refine AI system outputs. Although such 100B+ parameter GPT variants have proven to demonstrate expertise in various clinical NLP tasks, such as the Medical Licensing Examination, there is scant research on their capacity to act as synthetic feedback experts and deliver expert-level edit feedback for improving the generation quality of weaker (<10B parameter) LLMs like GPT-2 (1.5B) & Llama 2 (7B) in clinical domain. So in this work, we leverage 100B+ GPT variants to act as synthetic feedback experts offering expert-level edit feedback, that is used to reduce hallucinations and align weaker (<10B parameter) LLMs with medical facts using two distinct alignment algorithms (DPO & SALT), endeavoring to narrow the divide between AI-generated content and factual accuracy. This highlights the substantial potential of LLM-based synthetic edits in enhancing the alignment of clinical factuality.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Rl4f: Generating natural language feedback with reinforcement learning for repairing model outputs. arXiv preprint arXiv:2305.08844.
  2. George J Annas. 2003. Hipaa regulations: a new era of medical-record privacy? New England Journal of Medicine, 348:1486.
  3. Better rewards yield better summaries: Learning to summarise without references. arXiv preprint arXiv:1909.01214.
  4. Inpars: Data augmentation for information retrieval using large language models. arXiv preprint arXiv:2202.05144.
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  6. Generation of patient after-visit summaries to support physicians. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6234–6247, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  7. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217.
  8. Purr: Efficiently editing language model hallucinations by denoising language model corruptions.
  9. A survey of chain of thought reasoning: Advances, frontiers and future. arXiv preprint arXiv:2309.15402.
  10. Chataug: Leveraging chatgpt for text data augmentation. ArXiv, abs/2302.13007.
  11. Auggpt: Leveraging chatgpt for text data augmentation.
  12. Promptagator: Few-shot dense retrieval from 8 examples. arXiv preprint arXiv:2209.11755.
  13. Openai baselines. https://github.com/openai/baselines.
  14. Is gpt-3 a good data annotator? arXiv preprint arXiv:2212.10450.
  15. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767.
  16. Gunther Eysenbach et al. 2023. The role of chatgpt, generative language models, and artificial intelligence in medical education: A conversation with chatgpt and a call for papers. JMIR Medical Education, 9(1):e46885.
  17. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056.
  18. How does chatgpt perform on the united states medical licensing examination? the implications of large language models for medical education and knowledge assessment. JMIR Medical Education, 9(1):e45312.
  19. Ai alignment: A comprehensive survey.
  20. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).
  21. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9.
  22. Evaluating the factual consistency of abstractive text summarization. arXiv preprint arXiv:1910.12840.
  23. Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLoS digital health, 2(2):e0000198.
  24. Rlaif: Scaling reinforcement learning from human feedback with ai feedback.
  25. On the feasibility of specialized ability extracting for large language code models.
  26. Differentiate chatgpt-generated and human-written medical texts.
  27. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
  28. Synthetic imitation edit feedback for factual alignment in clinical summarization. arXiv preprint arXiv:2310.20033.
  29. OpenAI. 2023. Gpt-4 technical report.
  30. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
  31. Language models as knowledge bases? arXiv preprint arXiv:1909.01066.
  32. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  33. Direct preference optimization: Your language model is secretly a reward model.
  34. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1).
  35. Medical data augmentation via chatgpt: A case study on medication identification and medication event classification. arXiv preprint arXiv:2306.07297.
  36. Learning to summarize from human feedback. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA. Curran Associates Inc.
  37. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc.
  38. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  39. Can language models be biomedical knowledge bases? arXiv preprint arXiv:2109.07154.
  40. Does synthetic data generation of llms help clinical text mining? arXiv preprint arXiv:2303.04360.
  41. Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow.
  42. Llama 2: Open foundation and fine-tuned chat models.
  43. Bioinstruct: Instruction tuning of large language models for biomedical natural language processing. arXiv preprint arXiv:2310.19975.
  44. Notechat: A dataset of synthetic doctor-patient conversations conditioned on clinical notes.
  45. Performance of multimodal gpt-4v on usmle with image: Potential for imaging diagnostic support with explanations. medRxiv.
  46. Extracting biomedical factual knowledge using pretrained language model and electronic health record context. arXiv preprint arXiv:2209.07859.
  47. Context variance evaluation of pretrained language models for prompt-based biomedical knowledge probing. arXiv preprint arXiv:2211.10265.
  48. Improving summarization with human edits. arXiv preprint arXiv:2310.05857.
  49. Gpt3mix: Leveraging large-scale language models for text augmentation. arXiv preprint arXiv:2104.08826.
  50. GPT3Mix: Leveraging large-scale language models for text augmentation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2225–2239, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  51. Rrhf: Rank responses to align language models with human feedback without tears.
  52. How language model hallucinations can snowball.
  53. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
  54. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425.
  55. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910.
  56. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
  57. Fine-tuning language models from human preferences.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Prakamya Mishra (7 papers)
  2. Zonghai Yao (33 papers)
  3. Parth Vashisht (2 papers)
  4. Beining Wang (6 papers)
  5. Vidhi Dhaval Mody (1 paper)
  6. Hong Yu (114 papers)
  7. Feiyun ouyang (5 papers)
Citations (3)