PAM: Prompting Audio-Language Models for Audio Quality Assessment (2402.00282v1)
Abstract: While audio quality is a key performance metric for various audio processing tasks, including generative modeling, its objective measurement remains a challenge. Audio-LLMs (ALMs) are pre-trained on audio-text pairs that may contain information about audio quality, the presence of artifacts, or noise. Given an audio input and a text prompt related to quality, an ALM can be used to calculate a similarity score between the two. Here, we exploit this capability and introduce PAM, a no-reference metric for assessing audio quality for different audio processing tasks. Contrary to other "reference-free" metrics, PAM does not require computing embeddings on a reference dataset nor training a task-specific model on a costly set of human listening scores. We extensively evaluate the reliability of PAM against established metrics and human listening scores on four tasks: text-to-audio (TTA), text-to-music generation (TTM), text-to-speech (TTS), and deep noise suppression (DNS). We perform multiple ablation studies with controlled distortions, in-the-wild setups, and prompt choices. Our evaluation shows that PAM correlates well with existing metrics and human listening scores. These results demonstrate the potential of ALMs for computing a general-purpose audio quality metric.
- Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
- Evaluating speech synthesis by training recognizers on synthetic speech. arXiv preprint arXiv:2310.00706, 2023.
- Non-intrusive speech quality assessment using neural networks. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 631–635, 2019. doi: 10.1109/ICASSP.2019.8683175.
- Perceptual objective listening quality assessment (polqa), the third generation itu-t standard for end-to-end speech quality measurement part i—temporal alignment. journal of the audio engineering society, 61(6):366–384, 2013a.
- Perceptual objective listening quality assessment (polqa), the third generation itu-t standard for end-to-end speech quality measurement part i—temporal alignment. journal of the audio engineering society, 61(6):366–384, 2013b.
- Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pp. 2709–2720. PMLR, 2022.
- Musicldm: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies. CoRR, abs/2308.01546, 2023.
- A vector quantized approach for text to speech synthesis on real-world spontaneous speech. arXiv preprint arXiv:2302.04215, 2023.
- Generalization ability of mos prediction networks. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8442–8446. IEEE, 2022.
- Simple and controllable music generation. arXiv preprint arXiv:2306.05284, 2023.
- High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
- Tackling toxic online communication with recurrent capsule networks. In 2018 Conference on Information and Communication Technology (CICT), pp. 1–7, 2018. doi: 10.1109/INFOCOMTECH.2018.8722433.
- Training audio captioning models without audio. arXiv preprint arXiv:2309.07372, 2023a.
- Pengi: An audio language model for audio tasks. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=gJLAfO4KUq.
- Describing emotions with acoustic property prompts for speech emotion recognition. arXiv preprint arXiv:2211.07737, 2022.
- Clap learning audio concepts from natural language supervision. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2023a. doi: 10.1109/ICASSP49357.2023.10095889.
- Natural language supervision for general-purpose audio representations. arXiv preprint arXiv:2309.05767, 2023b.
- Learning with learned loss function: Speech enhancement with quality-net to improve perceptual evaluation of speech quality. IEEE Signal Processing Letters, 27:26–30, 2019.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780, 2017. doi: 10.1109/ICASSP.2017.7952261.
- Compa: Addressing the gap in compositional reasoning in audio-language models. arXiv preprint arXiv:2310.08753, 2023.
- Adapting frechet audio distance for generative music evaluation. arXiv preprint arXiv:2311.01616, 2023.
- Cnn architectures for large-scale audio classification. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2017. URL https://arxiv.org/abs/1609.09430.
- Robustness of speech quality metrics to background noise and network degradations: Comparing visqol, pesq and polqa. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3697–3701. IEEE, 2013.
- Visqol: an objective speech quality model. EURASIP Journal on Audio, Speech, and Music Processing, 2015(1):1–18, 2015.
- Mulan: A joint embedding of music audio and natural language. In International Society for Music Information Retrieval Conference, 2022a.
- The voicemos challenge 2022. arXiv preprint arXiv:2203.11389, 2022b.
- Fr\\\backslash\’echet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2018.
- AudioCaps: Generating Captions for Audios in The Wild. In NAACL-HLT, 2019.
- Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020a.
- Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 2020b. ISSN 2329-9290. doi: 10.1109/TASLP.2020.3030497. URL https://doi.org/10.1109/TASLP.2020.3030497.
- Audiogen: Textually guided audio generation. In The Eleventh International Conference on Learning Representations, 2022.
- Voicebox: Text-guided multilingual universal speech generation at scale. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Styletts: A style-based generative model for natural and diverse text-to-speech synthesis. arXiv preprint arXiv:2205.15439, 2022.
- Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023a.
- Audioldm 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023b.
- Mosnet: Deep learning-based objective assessment for voice conversion. Interspeech 2019, 2019.
- Speechlmscore: Evaluating speech generation using speech language model. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2023.
- Manjunath, T. Limitations of perceptual evaluation of speech quality on voip systems. In 2009 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting, pp. 1–6. IEEE, 2009.
- A differentiable perceptual audio metric learned from just noticeable differences. arXiv preprint arXiv:2001.04460, 2020.
- Cdpam: Contrastive learning for perceptual audio similarity. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 196–200. IEEE, 2021.
- NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets. In Proc. Interspeech 2021, pp. 2127–2131, 2021. doi: 10.21437/Interspeech.2021-299.
- Mubert-Inc. Mubert, 2023. URL https://mubert.com/. Available at: https://mubert.com/.
- Pearson, K. Notes on the history of correlation. Biometrika, 13(1):25–45, 1920.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020a. URL http://jmlr.org/papers/v21/20-074.html.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020b.
- Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6493–6497. IEEE, 2021a.
- Icassp 2021 deep noise suppression challenge. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6623–6627, 2021b. doi: 10.1109/ICASSP39728.2021.9415105.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
- Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
- Sesqa: semi-supervised learning for speech quality assessment. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 381–385. IEEE, 2021.
- Objective measures of perceptual audio quality reviewed: An evaluation of their application domain dependence. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:1530–1541, 2021.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. arXiv preprint arXiv:2211.06687, 2022.
- The blizzard challenge 2019. In Proc. Blizzard Challenge Workshop, volume 2019, 2019.
- Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882, 2019.
- Training supervised speech separation system to improve stoi and pesq directly. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5374–5378. IEEE, 2018.
- Narle: Natural language models using reinforcement learning with emotion feedback. arXiv preprint arXiv:2110.02148, 2021.