Suitability of ECE and Brier Score for Calibration in Natural Language Generation
Determine whether the Expected Calibration Error (ECE) and the Brier Score are suitable metrics for measuring the calibration of confidence scores for natural language outputs, such as generated audio captions, given that these metrics are traditionally designed and validated for classification tasks.
References
The Expected Calibration Error and Brier Score are well-suited to measure the quality of calibration of confidences for classification tasks. Its suitability to measure calibration of natural language is yet to be evaluated independently.
— Resource-Efficient Reference-Free Evaluation of Audio Captions
(2409.08489 - Mahfuz et al., 2024) in Section "Limitations"