Suitability of ECE and Brier Score for Calibration in Natural Language Generation

Determine whether the Expected Calibration Error (ECE) and the Brier Score are suitable metrics for measuring the calibration of confidence scores for natural language outputs, such as generated audio captions, given that these metrics are traditionally designed and validated for classification tasks.

Background

The paper evaluates lightweight, reference-free confidence metrics for audio captioning by measuring their calibration against correctness measures using two standard calibration statistics: Expected Calibration Error (ECE) and Brier Score. These statistics are widely used in classification settings to quantify how well predicted probabilities align with observed outcomes.

Despite using ECE and Brier Score in their experiments, the authors explicitly note that these measures were developed for classification tasks and have not been independently validated for natural language generation. Establishing their suitability (or identifying alternatives) for text-based outputs would strengthen the methodological foundation for calibration in NLG contexts.

References

The Expected Calibration Error and Brier Score are well-suited to measure the quality of calibration of confidences for classification tasks. Its suitability to measure calibration of natural language is yet to be evaluated independently.

Resource-Efficient Reference-Free Evaluation of Audio Captions  (2409.08489 - Mahfuz et al., 2024) in Section "Limitations"