PAM: Prompting Audio-Language Models for Audio Quality Assessment (2402.00282v1)

Published 1 Feb 2024 in eess.AS and cs.SD

Abstract: While audio quality is a key performance metric for various audio processing tasks, including generative modeling, its objective measurement remains a challenge. Audio-LLMs (ALMs) are pre-trained on audio-text pairs that may contain information about audio quality, the presence of artifacts, or noise. Given an audio input and a text prompt related to quality, an ALM can be used to calculate a similarity score between the two. Here, we exploit this capability and introduce PAM, a no-reference metric for assessing audio quality for different audio processing tasks. Contrary to other "reference-free" metrics, PAM does not require computing embeddings on a reference dataset nor training a task-specific model on a costly set of human listening scores. We extensively evaluate the reliability of PAM against established metrics and human listening scores on four tasks: text-to-audio (TTA), text-to-music generation (TTM), text-to-speech (TTS), and deep noise suppression (DNS). We perform multiple ablation studies with controlled distortions, in-the-wild setups, and prompt choices. Our evaluation shows that PAM correlates well with existing metrics and human listening scores. These results demonstrate the potential of ALMs for computing a general-purpose audio quality metric.

References (56)

Citations (8)

View on Semantic Scholar

Summary

The paper introduces PAM, a novel no-reference metric that leverages dual prompts and cosine similarity in a joint audio-text space to evaluate audio quality.
It empirically demonstrates PAM's strong correlation (PCC > 0.7) with human ratings across text-to-audio, text-to-speech, text-to-music, and deep noise suppression tasks.
The study highlights PAM’s scalable evaluation potential and its limitations in speech tasks, suggesting avenues for future enhancement with enriched training data.

Assessment of Audio Quality via Prompting Audio-LLMs

The quest for a reliable method to evaluate audio quality in various audio generation tasks, such as text-to-audio (TTA), text-to-music (TTM), text-to-speech (TTS), and deep noise suppression (DNS), continues to garner significant interest. This paper presents PAM (Prompting Audio-LLMs), a novel approach utilizing the potentials of Audio-LLMs (ALMs) to assess audio quality without reference, aligning closely with human perceptual scores. The paper delineates the conceptualization, implementation, and empirical validation of this metric, offering insights into its effectiveness over diverse audio tasks.

Background and Motivation

Traditional audio quality assessment relies heavily on subjective human judgments, which are resource-intensive and hinder scalability. Objective metrics often require reference audio, making them impractical in all scenarios. Existing reference-free metrics depend on pre-trained models and curated human scores for task-specific evaluations. This paper posits that Audio-LLMs, trained on extensive audio-text datasets, implicitly grasp the nuances of audio quality, thus serving as an advantageous framework for no-reference audio quality assessment.

PAM: A Novel Metric

PAM is designed by leveraging ALMs' capability to encode audio and text prompts into a joint multimodal space, where the cosine similarity between text and audio embeddings is computed to yield a quality score. Notably, two antonymous prompts—"the sound is clear and clean" versus "the sound is noisy and with artifacts"—are employed to enhance the model's ability to discern the quality aspect. This two-prompt strategy is pivotal in removing contextual ambiguities that arise from singular prompt evaluations, thereby tuning the metric's sensitivity to artifacts and distortions prevalent across audio types.

Experimental Evaluation

The robustness of PAM is validated against existing metrics and human listening scores across four tasks: TTA, TTM, TTS, and DNS. Experiments encompass contrived distortions and in-the-wild scenarios, affirming that PAM's correlation with human ratings is comparable, if not superior, to established models across tasks. For instance, PAM displayed significant correlation coefficients (PCC > 0.7) when benchmarked against human assessments for naturalness and fidelity in generated audio. Moreover, PAM was shown to be particularly proficient in measuring general audio and music quality, albeit less adapted to speech tasks due to the linguistic training limitations of the underlying ALM.

Practical and Theoretical Implications

The paper anticipates PAM's utility in scalable evaluations of generative audio models due to its zero-shot nature. The ability to quickly adapt to novel audio types or tasks without re-training enhances its practical relevance in rapid prototyping and evaluation pipelines. Theoretically, PAM advocates the potential of ALMs in non-traditional assessment avenues by harnessing the nuanced language-audio semantics. As AI-driven audio synthesis becomes ubiquitous, metrics like PAM can revolutionize auditory content evaluation without reliance on task-specific training data.

Future Directions

Despite promising results, the paper recognizes PAM's constraints, notably in fine-grained quality discernment within speech-related tasks. Future exploration could involve enriching ALM training datasets with speech-text examples or developing task-specific prompts that capture subtler audio qualities. Moreover, incorporating a diversified set of prompt pairs could refine the metric across more specialized audio nuances, paving the way for comprehensive audio evaluation strategies.

In conclusion, PAM represents a significant stride towards holistic, scalable, and flexible audio quality assessment, leveraging the extensive capabilities of Audio-LLMs. Its adaptability across different audio domains marks it as a substantial contributory advance in audio processing research, with implications predicting further integration into generic and specialized audio applications.

PDF Markdown

Tweets

https://twitter.com/serrjoa/status/1753349711614943670

https://twitter.com/pawelmarciniuk/status/1754958171221725588