Reliable objective metrics for perceived audio quality
Develop objective evaluation metrics that reliably correlate with human judgments of perceived audio quality across architectures and training objectives, overcoming the poor alignment observed between existing metrics such as VisQOL and MOSNet and subjective MUSHRA ratings.
References
This observation underscores the open challenge of designing reliable objective proxies for perceived quality.
— Moshi: a speech-text foundation model for real-time dialogue
(2410.00037 - Défossez et al., 17 Sep 2024) in Section 5.2, Audio Tokenization (Discussion)