Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PAM: Prompting Audio-Language Models for Audio Quality Assessment (2402.00282v1)

Published 1 Feb 2024 in eess.AS and cs.SD

Abstract: While audio quality is a key performance metric for various audio processing tasks, including generative modeling, its objective measurement remains a challenge. Audio-LLMs (ALMs) are pre-trained on audio-text pairs that may contain information about audio quality, the presence of artifacts, or noise. Given an audio input and a text prompt related to quality, an ALM can be used to calculate a similarity score between the two. Here, we exploit this capability and introduce PAM, a no-reference metric for assessing audio quality for different audio processing tasks. Contrary to other "reference-free" metrics, PAM does not require computing embeddings on a reference dataset nor training a task-specific model on a costly set of human listening scores. We extensively evaluate the reliability of PAM against established metrics and human listening scores on four tasks: text-to-audio (TTA), text-to-music generation (TTM), text-to-speech (TTS), and deep noise suppression (DNS). We perform multiple ablation studies with controlled distortions, in-the-wild setups, and prompt choices. Our evaluation shows that PAM correlates well with existing metrics and human listening scores. These results demonstrate the potential of ALMs for computing a general-purpose audio quality metric.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
  2. Evaluating speech synthesis by training recognizers on synthetic speech. arXiv preprint arXiv:2310.00706, 2023.
  3. Non-intrusive speech quality assessment using neural networks. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  631–635, 2019. doi: 10.1109/ICASSP.2019.8683175.
  4. Perceptual objective listening quality assessment (polqa), the third generation itu-t standard for end-to-end speech quality measurement part i—temporal alignment. journal of the audio engineering society, 61(6):366–384, 2013a.
  5. Perceptual objective listening quality assessment (polqa), the third generation itu-t standard for end-to-end speech quality measurement part i—temporal alignment. journal of the audio engineering society, 61(6):366–384, 2013b.
  6. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pp.  2709–2720. PMLR, 2022.
  7. Musicldm: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies. CoRR, abs/2308.01546, 2023.
  8. A vector quantized approach for text to speech synthesis on real-world spontaneous speech. arXiv preprint arXiv:2302.04215, 2023.
  9. Generalization ability of mos prediction networks. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  8442–8446. IEEE, 2022.
  10. Simple and controllable music generation. arXiv preprint arXiv:2306.05284, 2023.
  11. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
  12. Tackling toxic online communication with recurrent capsule networks. In 2018 Conference on Information and Communication Technology (CICT), pp.  1–7, 2018. doi: 10.1109/INFOCOMTECH.2018.8722433.
  13. Training audio captioning models without audio. arXiv preprint arXiv:2309.07372, 2023a.
  14. Pengi: An audio language model for audio tasks. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=gJLAfO4KUq.
  15. Describing emotions with acoustic property prompts for speech emotion recognition. arXiv preprint arXiv:2211.07737, 2022.
  16. Clap learning audio concepts from natural language supervision. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5, 2023a. doi: 10.1109/ICASSP49357.2023.10095889.
  17. Natural language supervision for general-purpose audio representations. arXiv preprint arXiv:2309.05767, 2023b.
  18. Learning with learned loss function: Speech enhancement with quality-net to improve perceptual evaluation of speech quality. IEEE Signal Processing Letters, 27:26–30, 2019.
  19. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  776–780, 2017. doi: 10.1109/ICASSP.2017.7952261.
  20. Compa: Addressing the gap in compositional reasoning in audio-language models. arXiv preprint arXiv:2310.08753, 2023.
  21. Adapting frechet audio distance for generative music evaluation. arXiv preprint arXiv:2311.01616, 2023.
  22. Cnn architectures for large-scale audio classification. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2017. URL https://arxiv.org/abs/1609.09430.
  23. Robustness of speech quality metrics to background noise and network degradations: Comparing visqol, pesq and polqa. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.  3697–3701. IEEE, 2013.
  24. Visqol: an objective speech quality model. EURASIP Journal on Audio, Speech, and Music Processing, 2015(1):1–18, 2015.
  25. Mulan: A joint embedding of music audio and natural language. In International Society for Music Information Retrieval Conference, 2022a.
  26. The voicemos challenge 2022. arXiv preprint arXiv:2203.11389, 2022b.
  27. Fr\\\backslash\’echet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2018.
  28. AudioCaps: Generating Captions for Audios in The Wild. In NAACL-HLT, 2019.
  29. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020a.
  30. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 2020b. ISSN 2329-9290. doi: 10.1109/TASLP.2020.3030497. URL https://doi.org/10.1109/TASLP.2020.3030497.
  31. Audiogen: Textually guided audio generation. In The Eleventh International Conference on Learning Representations, 2022.
  32. Voicebox: Text-guided multilingual universal speech generation at scale. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  33. Styletts: A style-based generative model for natural and diverse text-to-speech synthesis. arXiv preprint arXiv:2205.15439, 2022.
  34. Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023a.
  35. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023b.
  36. Mosnet: Deep learning-based objective assessment for voice conversion. Interspeech 2019, 2019.
  37. Speechlmscore: Evaluating speech generation using speech language model. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023.
  38. Manjunath, T. Limitations of perceptual evaluation of speech quality on voip systems. In 2009 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting, pp.  1–6. IEEE, 2009.
  39. A differentiable perceptual audio metric learned from just noticeable differences. arXiv preprint arXiv:2001.04460, 2020.
  40. Cdpam: Contrastive learning for perceptual audio similarity. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  196–200. IEEE, 2021.
  41. NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets. In Proc. Interspeech 2021, pp.  2127–2131, 2021. doi: 10.21437/Interspeech.2021-299.
  42. Mubert-Inc. Mubert, 2023. URL https://mubert.com/. Available at: https://mubert.com/.
  43. Pearson, K. Notes on the history of correlation. Biometrika, 13(1):25–45, 1920.
  44. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020a. URL http://jmlr.org/papers/v21/20-074.html.
  45. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020b.
  46. Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6493–6497. IEEE, 2021a.
  47. Icassp 2021 deep noise suppression challenge. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6623–6627, 2021b. doi: 10.1109/ICASSP39728.2021.9415105.
  48. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  49. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
  50. Sesqa: semi-supervised learning for speech quality assessment. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  381–385. IEEE, 2021.
  51. Objective measures of perceptual audio quality reviewed: An evaluation of their application domain dependence. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:1530–1541, 2021.
  52. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. arXiv preprint arXiv:2211.06687, 2022.
  53. The blizzard challenge 2019. In Proc. Blizzard Challenge Workshop, volume 2019, 2019.
  54. Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882, 2019.
  55. Training supervised speech separation system to improve stoi and pesq directly. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  5374–5378. IEEE, 2018.
  56. Narle: Natural language models using reinforcement learning with emotion feedback. arXiv preprint arXiv:2110.02148, 2021.
Citations (8)

Summary

  • The paper introduces PAM, a novel no-reference metric that leverages dual prompts and cosine similarity in a joint audio-text space to evaluate audio quality.
  • It empirically demonstrates PAM's strong correlation (PCC > 0.7) with human ratings across text-to-audio, text-to-speech, text-to-music, and deep noise suppression tasks.
  • The study highlights PAM’s scalable evaluation potential and its limitations in speech tasks, suggesting avenues for future enhancement with enriched training data.

Assessment of Audio Quality via Prompting Audio-LLMs

The quest for a reliable method to evaluate audio quality in various audio generation tasks, such as text-to-audio (TTA), text-to-music (TTM), text-to-speech (TTS), and deep noise suppression (DNS), continues to garner significant interest. This paper presents PAM (Prompting Audio-LLMs), a novel approach utilizing the potentials of Audio-LLMs (ALMs) to assess audio quality without reference, aligning closely with human perceptual scores. The paper delineates the conceptualization, implementation, and empirical validation of this metric, offering insights into its effectiveness over diverse audio tasks.

Background and Motivation

Traditional audio quality assessment relies heavily on subjective human judgments, which are resource-intensive and hinder scalability. Objective metrics often require reference audio, making them impractical in all scenarios. Existing reference-free metrics depend on pre-trained models and curated human scores for task-specific evaluations. This paper posits that Audio-LLMs, trained on extensive audio-text datasets, implicitly grasp the nuances of audio quality, thus serving as an advantageous framework for no-reference audio quality assessment.

PAM: A Novel Metric

PAM is designed by leveraging ALMs' capability to encode audio and text prompts into a joint multimodal space, where the cosine similarity between text and audio embeddings is computed to yield a quality score. Notably, two antonymous prompts—"the sound is clear and clean" versus "the sound is noisy and with artifacts"—are employed to enhance the model's ability to discern the quality aspect. This two-prompt strategy is pivotal in removing contextual ambiguities that arise from singular prompt evaluations, thereby tuning the metric's sensitivity to artifacts and distortions prevalent across audio types.

Experimental Evaluation

The robustness of PAM is validated against existing metrics and human listening scores across four tasks: TTA, TTM, TTS, and DNS. Experiments encompass contrived distortions and in-the-wild scenarios, affirming that PAM's correlation with human ratings is comparable, if not superior, to established models across tasks. For instance, PAM displayed significant correlation coefficients (PCC > 0.7) when benchmarked against human assessments for naturalness and fidelity in generated audio. Moreover, PAM was shown to be particularly proficient in measuring general audio and music quality, albeit less adapted to speech tasks due to the linguistic training limitations of the underlying ALM.

Practical and Theoretical Implications

The paper anticipates PAM's utility in scalable evaluations of generative audio models due to its zero-shot nature. The ability to quickly adapt to novel audio types or tasks without re-training enhances its practical relevance in rapid prototyping and evaluation pipelines. Theoretically, PAM advocates the potential of ALMs in non-traditional assessment avenues by harnessing the nuanced language-audio semantics. As AI-driven audio synthesis becomes ubiquitous, metrics like PAM can revolutionize auditory content evaluation without reliance on task-specific training data.

Future Directions

Despite promising results, the paper recognizes PAM's constraints, notably in fine-grained quality discernment within speech-related tasks. Future exploration could involve enriching ALM training datasets with speech-text examples or developing task-specific prompts that capture subtler audio qualities. Moreover, incorporating a diversified set of prompt pairs could refine the metric across more specialized audio nuances, paving the way for comprehensive audio evaluation strategies.

In conclusion, PAM represents a significant stride towards holistic, scalable, and flexible audio quality assessment, leveraging the extensive capabilities of Audio-LLMs. Its adaptability across different audio domains marks it as a substantial contributory advance in audio processing research, with implications predicting further integration into generic and specialized audio applications.