Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multiscale Matching Driven by Cross-Modal Similarity Consistency for Audio-Text Retrieval (2403.10146v1)

Published 15 Mar 2024 in cs.SD, cs.IR, and eess.AS

Abstract: Audio-text retrieval (ATR), which retrieves a relevant caption given an audio clip (A2T) and vice versa (T2A), has recently attracted much research attention. Existing methods typically aggregate information from each modality into a single vector for matching, but this sacrifices local details and can hardly capture intricate relationships within and between modalities. Furthermore, current ATR datasets lack comprehensive alignment information, and simple binary contrastive learning labels overlook the measurement of fine-grained semantic differences between samples. To counter these challenges, we present a novel ATR framework that comprehensively captures the matching relationships of multimodal information from different perspectives and finer granularities. Specifically, a fine-grained alignment method is introduced, achieving a more detail-oriented matching through a multiscale process from local to global levels to capture meticulous cross-modal relationships. In addition, we pioneer the application of cross-modal similarity consistency, leveraging intra-modal similarity relationships as soft supervision to boost more intricate alignment. Extensive experiments validate the effectiveness of our approach, outperforming previous methods by significant margins of at least 3.9% (T2A) / 6.9% (A2T) R@1 on the AudioCaps dataset and 2.9% (T2A) / 5.4% (A2T) R@1 on the Clotho dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. “Visual semantic reasoning for image-text matching,” in Proc, ICCV, 2019, pp. 4654–4662.
  2. “Learning transferable visual models from natural language supervision,” in Proc, ICML. PMLR, 2021, pp. 8748–8763.
  3. “LXMERT: Learning cross-modality encoder representations from transformers,” in Proc, EMNLP-IJCNLP, 2019, pp. 5100–5111.
  4. “Filip: Fine-grained interactive language-image pre-training,” arXiv preprint arXiv:2111.07783, 2021.
  5. “Universal vision-language dense retrieval: Learning a unified representation space for multi-modal retrieval,” in Proc, ICLRs, 2022.
  6. “On metric learning for audio-text cross-modal retrieval,” in Proc, Interspeech 2022, 2022.
  7. “Audio retrieval with wavtext5k and clap training,” arXiv preprint arXiv:2209.14275, 2022.
  8. “Audio-text retrieval in context,” in Proc, ICASSP. IEEE, 2022, pp. 4793–4797.
  9. “Clap learning audio concepts from natural language supervision,” in Proc,ICASSP. IEEE, 2023, pp. 1–5.
  10. “Stacked cross attention for image-text matching,” in Proc, ECCV, 2018, pp. 201–216.
  11. “WeaQA: Weak supervision via captions for visual question answering,” in Proc, ACL-IJCNLP, 2021, pp. 3420–3435.
  12. “ONE-PEACE: Exploring one general representation model toward unlimited modalities,” arXiv preprint arXiv:2305.11172, 2023.
  13. “Masked autoencoders that listen,” Proc, NeurIPS, vol. 35, pp. 28708–28720, 2022.
  14. “Multi-grained representation learning for cross-modal retrieval,” in Proc, SIGIR, 2023, pp. 2194–2198.
  15. “Improving text-audio retrieval by text-aware attention pooling and prior matrix revised loss,” in Proc, ICASSP. IEEE, 2023, pp. 1–5.
  16. “Audiocaps: Generating captions for audios in the wild,” in Proc, NAACL, 2019, pp. 119–132.
  17. “Clotho: An audio captioning dataset,” in Proc, ICASSP. IEEE, 2020, pp. 736–740.
  18. “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in Proc, ICASSP. IEEE, 2023, pp. 1–5.
  19. “Audio retrieval with natural language queries: A benchmark study,” IEEE Transactions on Multimedia, 2022.
  20. “HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection,” in Proc, ICASSP. IEEE, 2022, pp. 646–650.
  21. “Attention is all you need,” Proc, NeurIPS, vol. 30, 2017.
  22. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proc, NAACL, 2019, pp. 4171–4186.
  23. “Deep sparse rectifier neural networks,” in Proc, AISTATS. JMLR Workshop and Conference Proceedings, 2011, pp. 315–323.
  24. “Colbert: Efficient and effective passage search via contextualized late interaction over bert,” in Proc, SIGIR, 2020, pp. 39–48.
  25. “A simple framework for contrastive learning of visual representations,” in Proc, ICML. PMLR, 2020, pp. 1597–1607.
  26. “Audio set: An ontology and human-labeled dataset for audio events,” in Proc, ICASSP. IEEE, 2017, pp. 776–780.
  27. “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Qian Wang (453 papers)
  2. Jia-Chen Gu (42 papers)
  3. Zhen-Hua Ling (114 papers)

Summary

We haven't generated a summary for this paper yet.