Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound (2502.05139v1)

Published 7 Feb 2025 in cs.SD, cs.LG, and eess.AS

Abstract: The quantification of audio aesthetics remains a complex challenge in audio processing, primarily due to its subjective nature, which is influenced by human perception and cultural context. Traditional methods often depend on human listeners for evaluation, leading to inconsistencies and high resource demands. This paper addresses the growing need for automated systems capable of predicting audio aesthetics without human intervention. Such systems are crucial for applications like data filtering, pseudo-labeling large datasets, and evaluating generative audio models, especially as these models become more sophisticated. In this work, we introduce a novel approach to audio aesthetic evaluation by proposing new annotation guidelines that decompose human listening perspectives into four distinct axes. We develop and train no-reference, per-item prediction models that offer a more nuanced assessment of audio quality. Our models are evaluated against human mean opinion scores (MOS) and existing methods, demonstrating comparable or superior performance. This research not only advances the field of audio aesthetics but also provides open-source models and datasets to facilitate future work and benchmarking. We release our code and pre-trained model at: https://github.com/facebookresearch/audiobox-aesthetics

Summary

The paper introduces a unified framework, Meta Audiobox Aesthetics, for automatic quality assessment across speech, music, and sound, decomposing quality into four interpretable axes: Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness.
Utilizing a novel annotation protocol and a Transformer-based model, the framework is trained on a large dataset of 97k human-annotated audio samples and demonstrates competitive performance against existing speech quality predictors on benchmark datasets.
The aesthetic predictor is successfully applied to improve text-to-speech, text-to-music, and text-to-audio generative models, showing that incorporating aesthetic scores as control tokens (prompting) significantly enhances subjective output quality while maintaining objective metrics, with models and a new dataset released for further research.

The paper presents a unified framework for automatic audio aesthetic assessment across diverse modalities—speech, music, and sound effects—by decomposing the subjective notion of audio quality into four interpretable axes. These axes are defined as follows:

Production Quality (PQ): Concentrates on objective technical aspects such as clarity, fidelity, dynamics, frequency balance, and spatialization.
Production Complexity (PC): Measures the number and diversity of audio components and modalities present in a given sample.
Content Enjoyment (CE): Captures subjective appreciation related to emotional impact, artistic expression, and overall appeal.
Content Usefulness (CU): Evaluates the potential utility of audio as a source for creative content generation.

A novel annotation protocol is introduced to collect scores for these axes. Experts and high-quality raters are employed with strict qualification criteria (e.g. Pearson correlation > 0.7 on objective axes) to limit inter-rater variability. Approximately 97 k audio samples are human annotated over 500 hours across multiple open-source and licensed datasets, ensuring a broad representation of real-world acoustic conditions. The paper also describes how loudness normalization, stratified sampling among modalities, and multiple ratings per sample are applied to control for biases and confounding factors.

The aesthetic evaluation model, termed Audiobox-Aesthetics, leverages a Transformer-based architecture. The audio encoder is architecturally based on WavLM and comprises 12 Transformer layers with a hidden dimension of 768. A learnable weighted aggregation over layers and timesteps is performed to extract an embedding vector $e \in \mathbb{R}^d$ , computed as follows:

$z_{l} = w_{l}, \quad \hat{e} = \sum_{t=1}^{T} \sum_{l=1}^{L} h_{l,t} z_{l}, \quad e = \frac{\hat{e}}{\Vert \hat{e} \Vert_{2}}$

$h_{l,t}$ : Transformer hidden state at layer $l$ and time $t$
$w_{l}$ : Learnable scalar for layer $l$
$T$ : Total number of time steps
$L$ : Total number of Transformer layers
$e$ : L2 normalized aggregated embedding

This embedding is then fed through multiple MLP blocks—each including linear layers, layer normalization, and GeLU activations—to regress predicted scores for the four axes. The loss function combines mean absolute error (MAE) and mean squared error (MSE) across the axes, ensuring both robustness and sensitivity in the regression.

Comprehensive evaluations are performed on public speech datasets (VMC22-main and VMC22-OOD) and on small benchmark sets for sound and music (PAM-sound and PAM-music). The proposed AES predictors for production quality, enjoyment, and usefulness show competitive (and in some cases superior) performance to existing speech-oriented quality predictors (e.g., DNSMOS, SQUIM, and UTMOSv2) as measured by utterance-level Pearson correlation coefficients (utt-PCC) and system-level Spearman’s rank correlation coefficients (sys-SRCC). Notably, the production complexity predictor is largely decoupled from conventional quality metrics, underscoring its potential to provide novel insights that are not captured by overall MOS scores.

The paper additionally introduces the AES-Natural dataset, which comprises around 2,950 audio samples covering speech, sound, and music, with detailed annotations across the four axes. This dataset facilitates further benchmarking and research on audio aesthetics.

A key strength of the work lies in its demonstration of downstream applications. The paper applies the aesthetic predictor in text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA) generation tasks. Two strategies are investigated:

Filtering Strategy: Low-scoring samples (based on a percentile threshold) are omitted during model training, which, while improving the quality, reduces the effective training size.
Prompting Strategy: Aesthetic scores are incorporated directly as control tokens in the text prompt (“Audio quality: $y$ ”), thereby steering the generative model. This method yields better subjective quality without sacrificing objective alignment metrics such as word error rate (WER) or CLAP scores. For example, subjective evaluations reveal net win rates for prompting over filtering that exceed 35% in some cases.

The results suggest that although filtering can improve certain objective measures, prompting provides a more balanced approach, preserving data alignment while enhancing perceptual quality. Objective metrics and extensive pairwise human evaluations across speech, sound, and music indicate that integrating aesthetic cues can significantly refine generative audio outputs.

In summary, the work proposes a detailed, multi-axis evaluation framework for audio aesthetics and demonstrates both its validity and practical utility in improving generative audio models. The open-source release of the models and AES-Natural dataset is expected to facilitate further research and benchmarking in this area.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - facebookresearch/audiobox-aesthetics: Unified automatic quality assessment for speech, music, and sound. (136 stars)

Tweets

https://twitter.com/AI_Homelab/status/1889697414338011287

https://twitter.com/SciTechAccess/status/1889869604677361915

https://twitter.com/AudioAndSpeech/status/1888994993311764959

https://twitter.com/mokoai_news/status/1889313479800312298