Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Refining Knowledge Transfer on Audio-Image Temporal Agreement for Audio-Text Cross Retrieval (2403.10756v1)

Published 16 Mar 2024 in eess.AS and cs.SD

Abstract: The aim of this research is to refine knowledge transfer on audio-image temporal agreement for audio-text cross retrieval. To address the limited availability of paired non-speech audio-text data, learning methods for transferring the knowledge acquired from a large amount of paired audio-image data to shared audio-text representation have been investigated, suggesting the importance of how audio-image co-occurrence is learned. Conventional approaches in audio-image learning assign a single image randomly selected from the corresponding video stream to the entire audio clip, assuming their co-occurrence. However, this method may not accurately capture the temporal agreement between the target audio and image because a single image can only represent a snapshot of a scene, though the target audio changes from moment to moment. To address this problem, we propose two methods for audio and image matching that effectively capture the temporal information: (i) Nearest Match wherein an image is selected from multiple time frames based on similarity with audio, and (ii) Multiframe Match wherein audio and image pairs of multiple time frames are used. Experimental results show that method (i) improves the audio-text retrieval performance by selecting the nearest image that aligns with the audio information and transferring the learned knowledge. Conversely, method (ii) improves the performance of audio-image retrieval while not showing significant improvements in audio-text retrieval performance. These results indicate that refining audio-image temporal agreement may contribute to better knowledge transfer to audio-text retrieval.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. K. Wang, Q. Yin, W. Wang, S. Wu, and L. Wang, “A comprehensive survey on cross-modal retrieval,” arXiv preprint arXiv:1607.06215, 2016.
  2. N. Vo, L. Jiang, C. Sun, K. Murphy, L.-J. Li, L. Fei-Fei, and J. Hays, “Composing text and image for image retrieval - an empirical odyssey,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 6432–6441.
  3. M. Cao, S. Li, J. Li, L. Nie, and M. Zhang, “Image-text retrieval: A survey on recent research and development,” in Int. Jt. Conf. Artif. Intell., 2022.
  4. H. Fang, P. Xiong, L. Xu, and Y. Chen, “CLIP2Video: Mastering video-text retrieval via image CLIP,” arXiv preprint arXiv:2106.11097, 2021.
  5. H. Xu, G. Ghosh, P.-Y. Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer, “VideoCLIP: Contrastive pre-training for zero-shot video-text understanding,” in Conf. Empir. Methods Nat. Lang. Process. (EMNLP), 2021, pp. 6787–6800.
  6. A. Guzhov, F. Raue, J. Hees, and A. Dengel, “AudioCLIP: Extending clip to image, text and audio,” in IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP).   IEEE, 2022, pp. 976–980.
  7. H.-H. Wu, P. Seetharaman, K. Kumar, and J. P. Bello, “Wav2CLIP: Learning robust audio representations from clip,” in IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP).   IEEE, 2022, pp. 4563–4567.
  8. L. Ruan, A. Hu, Y. Song, L. Zhang, S. Zheng, and Q. Jin, “Accommodating audio modality in CLIP for multimodal processing,” arXiv preprint arXiv:2303.06591, 2023.
  9. H. Akbari, L. Yuan, R. Qian, W.-H. Chuang, S.-F. Chang, Y. Cui, and B. Gong, “VATT: Transformers for multimodal self-supervised learning from raw video, audio and text,” Adv. Neural Inf. Process. Syst (NeurIPS), vol. 34, pp. 24 206–24 221, 2021.
  10. A.-M. Oncescu, A. S. Koepke, J. Henriques, Z. Akata, and S. Albanie, “Audio retrieval with natural language queries,” in INTERSPEECH, 2021.
  11. K. Srinivasan, K. Raman, J. Chen, M. Bendersky, and M. Najork, “WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning,” in Int. ACM SIGIR Conf. Res. Dev. Inf. Retr.   ACM, jul 2021.
  12. X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick, “Microsoft COCO captions: Data collection and evaluation server,” arXiv preprint arXiv:1504.00325, 2015.
  13. K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An audio captioning dataset,” in IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP).   IEEE, 2020, pp. 736–740.
  14. C. D. Kim, B. Kim, H. Lee, and G. Kim, “AudioCaps: Generating Captions for Audios in The Wild,” in N. Am. Chapter Assoc. Comput. Linguist.: Hum. Lang. Technol. (NAACL-HLT), 2019.
  15. A. S. Koepke, A.-M. Oncescu, J. Henriques, Z. Akata, and S. Albanie, “Audio retrieval with natural language queries: A benchmark study,” IEEE Trans. Multimed., pp. 1–1, 2022.
  16. Y. Zhao, J. Hessel, Y. Yu, X. Lu, R. Zellers, and Y. Choi, “Connecting the dots between audio and text without parallel data through visual knowledge transfer,” in N. Am. Chapter Assoc. Comput. Linguist. (NAACL), Jul. 2022, pp. 4492–4507.
  17. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Int. Conf. Mach. Learn., 2021, pp. 8748–8763.
  18. J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio Set: An ontology and human-labeled dataset for audio events,” in IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2017, pp. 776–780.
  19. P. Morgado, N. Vasconcelos, and I. Misra, “Audio-visual instance discrimination with cross-modal agreement,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 12 475–12 486.
  20. K. Sohn, “Improved deep metric learning with multi-class N-pair loss objective,” Adv. Neural Inf. Process. Syst., vol. 29, 2016.
  21. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  22. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 30, 2017.
  23. J. Lei, L. Li, L. Zhou, Z. Gan, T. L. Berg, M. Bansal, and J. Liu, “Less is more: ClipBERT for video-and-language learning via sparse sampling,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), June 2021, pp. 7331–7341.
  24. Y. You, I. Gitman, and B. Ginsburg, “Scaling SGD batch size to 32k for imagenet training,” arXiv preprint arXiv:1708.03888, 2017.
  25. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” in IEEE Workshop Autom. Speech Recognit. Underst. (ASRU), Dec. 2011.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com