Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models (2310.08753v4)

Published 12 Oct 2023 in cs.SD, cs.AI, cs.CL, and eess.AS

Abstract: A fundamental characteristic of audio is its compositional nature. Audio-LLMs (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perform compositional reasoning remains largely unexplored and necessitates additional research. In this paper, we propose CompA, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed CompA-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and CompA-attribute evaluates attribute-binding of acoustic events. An instance from either benchmark consists of two audio-caption pairs, where both audios have the same acoustic events but with different compositions. An ALM is evaluated on how well it matches the right audio to the right caption. Using this benchmark, we first show that current ALMs perform only marginally better than random chance, thereby struggling with compositional reasoning. Next, we propose CompA-CLAP, where we fine-tune CLAP using a novel learning method to improve its compositional reasoning abilities. To train CompA-CLAP, we first propose improvements to contrastive training with composition-aware hard negatives, allowing for more focused training. Next, we propose a novel modular contrastive loss that helps the model learn fine-grained compositional understanding and overcomes the acute scarcity of openly available compositional audios. CompA-CLAP significantly improves over all our baseline models on the CompA benchmark, indicating its superior compositional reasoning capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. A dump of BBC’s sound effects library, 12 2018. URL http://bbcsfx.acropolis.org.uk/. This dump was created using the script found at https://github.com/FThompson/BBCSoundDownloader. Identifier: BBCSoundEffectsComplete.
  2. URL https://soundbible.com/. Accessed: 25 September 2023.
  3. Musiclm: Generating music from text, 2023.
  4. Covr: A test-bed for visually grounded compositional generalization with real images. arXiv preprint arXiv:2109.10613, 2021.
  5. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  721–725. IEEE, 2020.
  6. Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  646–650. IEEE, 2022.
  7. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  8. Testing relational understanding in text-guided image generation. arXiv preprint arXiv:2208.00005, 2022.
  9. Audio retrieval with wavtext5k and CLAP training. CoRR, abs/2209.14275, 2022a.
  10. Audio retrieval with wavtext5k and clap training. arXiv preprint arXiv:2209.14275, 2022b.
  11. Pengi: An audio language model for audio tasks, 2023.
  12. Clotho: An audio captioning dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  736–740. IEEE, 2020.
  13. Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023.
  14. FSD50K: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:829–852, 2022.
  15. SyntaxGym: An online platform for targeted evaluation of language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp.  70–76, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-demos.10. URL https://aclanthology.org/2020.acl-demos.10.
  16. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  776–780. IEEE, 2017.
  17. Text-to-audio generation using instruction tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023.
  18. Recap: Retrieval-augmented audio captioning, 2023.
  19. Finetune like you pretrain: Improved finetuning of zero-shot vision models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  19338–19347, 2023.
  20. Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1195–1205, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1108. URL https://aclanthology.org/N18-1108.
  21. The benefit of temporally-strong labels in audio event classification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  366–370. IEEE, 2021.
  22. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, 2017.
  23. A systematic assessment of syntactic generalization in neural language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  1725–1744, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.158. URL https://aclanthology.org/2020.acl-main.158.
  24. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661, 2023.
  25. Comclip: Training-free compositional image and text matching. arXiv preprint arXiv:2211.13854, 2022.
  26. Verb argument structure alternations in word and sentence embeddings. In Proceedings of the Society for Computation in Linguistics (SCiL) 2019, pp.  287–297, 2019. doi: 10.7275/q5js-4y86. URL https://aclanthology.org/W19-0129.
  27. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  119–132, 2019.
  28. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
  29. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521–535, 2016. doi: 10.1162/tacl˙a˙00115. URL https://aclanthology.org/Q16-1037.
  30. Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023a.
  31. AudioLDM: Text-to-audio generation with latent diffusion models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  21450–21474. PMLR, 23–29 Jul 2023b.
  32. Separate anything you describe, 2023c.
  33. Wavjourney: Compositional audio creation with large language models. arXiv preprint arXiv:2307.14335, 2023d.
  34. Medley-solos-DB: a cross-collection dataset for musical instrument recognition, September 2019. URL https://doi.org/10.5281/zenodo.3464194.
  35. Crepe: Can vision-language foundation models reason compositionally? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10910–10921, 2023.
  36. On metric learning for audio-text cross-modal retrieval. arXiv preprint arXiv:2203.15537, 2022.
  37. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
  38. Macs - multi-annotator captioned soundscapes, July 2021. URL https://doi.org/10.5281/zenodo.5114771.
  39. Audio retrieval with natural language queries. arXiv preprint arXiv:2105.02192, 2021.
  40. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  41. OpenAI. Gpt-4 technical report, 2023.
  42. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  43. Karol J. Piczak. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, pp.  1015–1018. ACM Press. ISBN 978-1-4503-3459-4. doi: 10.1145/2733373.2806390. URL http://dl.acm.org/citation.cfm?doid=2733373.2806390.
  44. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  45. Nils Reimers. Sentence transformers. https://github.com/UKPLab/sentence-transformers, 2023.
  46. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  47. Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301, 2018.
  48. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  49. A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM International Conference on Multimedia, MM ’14, pp.  1041–1044, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450330633. doi: 10.1145/2647868.2655045. URL https://doi.org/10.1145/2647868.2655045.
  50. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  51. Masked language modeling and the distributional hypothesis: Order word matters pre-training for little. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  2888–2913, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.230. URL https://aclanthology.org/2021.emnlp-main.230.
  52. Sonniss Limited. Sonniss Game Audio, 2022. URL https://sonniss.com/gameaudiogdc. Registered in England, UK. Company number: 09377364. Accessed: 25 September 2023.
  53. Investigating novel verb learning in BERT: Selectional preference classes and alternation-based syntactic generalization. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp.  265–275, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.blackboxnlp-1.25. URL https://aclanthology.org/2020.blackboxnlp-1.25.
  54. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5238–5248, 2022.
  55. Learning from between-class examples for deep sound recognition. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1Gi6LeRZ.
  56. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  57. Automatic musical genre classification of audio signals, 2001. URL http://ismir2001.ismir.net/pdf/tzanetakis.pdf.
  58. Investigating bert’s knowledge of language: Five analysis methods with npis. CoRR, abs/1909.02597, 2019. URL http://arxiv.org/abs/1909.02597.
  59. BLiMP: The benchmark of linguistic minimal pairs for English. Transactions of the Association for Computational Linguistics, 8:377–392, 2020. doi: 10.1162/tacl˙a˙00321. URL https://aclanthology.org/2020.tacl-1.25.
  60. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7959–7971, 2022.
  61. Wav2clip: Learning robust audio representations from clip. 2022.
  62. Audio-text models do not yet leverage natural language. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023.
  63. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023.
  64. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023.
  65. When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=KRLUvxh8uaX.
  66. Libritts: A corpus derived from librispeech for text-to-speech, 2019.
  67. Gender bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:1804.06876, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Sreyan Ghosh (46 papers)
  2. Ashish Seth (22 papers)
  3. Sonal Kumar (30 papers)
  4. Utkarsh Tyagi (18 papers)
  5. Chandra Kiran Evuru (2 papers)
  6. Oriol Nieto (22 papers)
  7. Ramani Duraiswami (40 papers)
  8. Dinesh Manocha (366 papers)
  9. S. Ramaneswaran (1 paper)
  10. S. Sakshi (1 paper)
Citations (15)
X Twitter Logo Streamline Icon: https://streamlinehq.com