Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation (2204.00447v1)

Published 1 Apr 2022 in cs.CL

Abstract: In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient's clinical safety. To address this we present an extensive human evaluation study of consultation notes where 5 clinicians (i) listen to 57 mock consultations, (ii) write their own notes, (iii) post-edit a number of automatically generated notes, and (iv) extract all the errors, both quantitative and qualitative. We then carry out a correlation study with 18 automatic quality metrics and the human judgements. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore. All our findings and annotations are open-sourced.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Francesco Moramarco (8 papers)
  2. Alex Papadopoulos Korfiatis (6 papers)
  3. Mark Perera (3 papers)
  4. Damir Juric (15 papers)
  5. Jack Flann (3 papers)
  6. Ehud Reiter (31 papers)
  7. Anya Belz (17 papers)
  8. Aleksandar Savkov (10 papers)
Citations (43)