Reliable objective metrics for perceived audio quality

Develop objective evaluation metrics that reliably correlate with human judgments of perceived audio quality across architectures and training objectives, overcoming the poor alignment observed between existing metrics such as VisQOL and MOSNet and subjective MUSHRA ratings.

Background

The paper evaluates Mimi, a neural audio codec, using both automatic metrics (VisQOL, MOSNet) and human MUSHRA tests. The authors find striking discrepancies: training with adversarial losses only substantially boosts perceived quality while degrading VisQOL, indicating weak correlation between these automatic metrics and human perception.

They explicitly characterize this gap as an open challenge, highlighting the need for better metrics that track perceived quality when architectures or training losses change.

References

This observation underscores the open challenge of designing reliable objective proxies for perceived quality.

— Moshi: a speech-text foundation model for real-time dialogue (2410.00037 - Défossez et al., 17 Sep 2024) in Section 5.2, Audio Tokenization (Discussion)

Reliable objective metrics for perceived audio quality

Background

References

Related Problems