Fine-grained evaluation of long-form biomedical answer generation
Investigate detailed capacities of long-form answer generation for biomedical and clinical question answering by assessing whether generated responses contain accurate rationales, the extent of hallucination, inclusion of crucial claims, and fluency, and determine evaluation approaches that capture these aspects more comprehensively than existing metrics.
References
However, these scores cannot measure whether a model has generated answers with accurate rationale, how much hallucination occurs, how much it includes crucial claims, or whether it has generated answers fluently. We leave an investigation about detailed capacities related to long-text generation for future works.
— Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models
(2401.15269 - Jeong et al., 2024) in Results and Analysis, Experimental Results