IryoNLP at MEDIQA-CORR 2024: Tackling the Medical Error Detection & Correction Task On the Shoulders of Medical Agents (2404.15488v1)
Abstract: In natural language processing applied to the clinical domain, utilizing LLMs has emerged as a promising avenue for error detection and correction on clinical notes, a knowledge-intensive task for which annotated data is scarce. This paper presents MedReAct'N'MedReFlex, which leverages a suite of four LLM-based medical agents. The MedReAct agent initiates the process by observing, analyzing, and taking action, generating trajectories to guide the search to target a potential error in the clinical notes. Subsequently, the MedEval agent employs five evaluators to assess the targeted error and the proposed correction. In cases where MedReAct's actions prove insufficient, the MedReFlex agent intervenes, engaging in reflective analysis and proposing alternative strategies. Finally, the MedFinalParser agent formats the final output, preserving the original style while ensuring the integrity of the error correction process. One core component of our method is our RAG pipeline based on our ClinicalCorp corpora. Among other well-known sources containing clinical guidelines and information, we preprocess and release the open-source MedWiki dataset for clinical RAG application. Our results demonstrate the central role of our RAG approach with ClinicalCorp leveraged through the MedReAct'N'MedReFlex framework. It achieved the ninth rank on the MEDIQA-CORR 2024 final leaderboard.
- Overview of the mediqa-chat 2023 shared tasks on the summarization & generation of doctor-patient conversations. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 503–513.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Overview of the mediqa-corr 2024 shared task on medical error detection and correction. In Proceedings of the 6th Clinical Natural Language Processing Workshop, Mexico City, Mexico. Association for Computational Linguistics.
- Medec: A benchmark for medical error detection and correction in clinical notes. CoRR.
- Language models are few-shot learners.
- Harrison Chase. 2022. Langchain. https://github.com/langchain-ai/langchain. Version 1.2.0, Released on October 17, 2022.
- Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079.
- Med42 - a clinical large language model.
- Openmedlm: Prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models. In AAAI 2024 Spring Symposium on Clinical Foundation Models.
- Maarten Grootendorst. 2022. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Mixtral of experts. arXiv preprint arXiv:2401.04088.
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
- Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval. Bioinformatics, 39(11):btad651.
- Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pages 15696–15707. PMLR.
- Dspy: Compiling declarative language model calls into self-improving pipelines. In R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models.
- Look at the first sentence: Position bias in question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1109–1121, Online. Association for Computational Linguistics.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
- Matryoshka representation learning. In Advances in Neural Information Processing Systems.
- Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- G-eval: Nlg evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522.
- Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4):824–836.
- When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822.
- Can generalist foundation models outcompete special-purpose tuning? case study in medicine. Medicine, 84(88.3):77–3.
- Training language models to follow instructions with human feedback.
- Fine-tuning or retrieval? comparing knowledge injection in llms. arXiv preprint arXiv:2312.05934.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992.
- The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
- Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366.
- Large language models encode clinical knowledge. Nature, 620(7972):172–180.
- Medagents: Large language models as collaborators for zero-shot medical reasoning. arXiv preprint arXiv:2311.10537.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295.
- Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Apollo: Lightweight multilingual medical llms towards democratizing medical ai to 6b people. arXiv preprint arXiv:2403.03640.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
- Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454.
- Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155.
- Enhancing medical text evaluation with gpt-4. arXiv preprint arXiv:2311.09581.
- Benchmarking retrieval-augmented generation for medicine. arXiv preprint arXiv:2402.13178.
- React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR).
- Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652.
- Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks. arXiv preprint arXiv:2305.17100.
- Self-discover: Large language models self-compose reasoning structures. arXiv preprint arXiv:2402.03620.
- Jean-Philippe Corbeil (5 papers)