Fine-grained evaluation of long-form biomedical answer generation

Investigate detailed capacities of long-form answer generation for biomedical and clinical question answering by assessing whether generated responses contain accurate rationales, the extent of hallucination, inclusion of crucial claims, and fluency, and determine evaluation approaches that capture these aspects more comprehensively than existing metrics.

Background

In long-form QA experiments, the authors evaluate with ROUGE and BERTScore but observe that these metrics cannot capture key qualitative properties such as rationale accuracy, hallucination, inclusion of crucial claims, and fluency.

They explicitly defer a deeper investigation of these long-text generation capacities to future work, highlighting the need for more informative evaluation methods in biomedical long-form generation.

References

However, these scores cannot measure whether a model has generated answers with accurate rationale, how much hallucination occurs, how much it includes crucial claims, or whether it has generated answers fluently. We leave an investigation about detailed capacities related to long-text generation for future works.

Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models  (2401.15269 - Jeong et al., 2024) in Results and Analysis, Experimental Results