Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models (2404.06309v1)

Published 9 Apr 2024 in cs.CV

Abstract: Audio-visual zero-shot learning methods commonly build on features extracted from pre-trained models, e.g. video or audio classification models. However, existing benchmarks predate the popularization of large multi-modal models, such as CLIP and CLAP. In this work, we explore such large pre-trained models to obtain features, i.e. CLIP for visual features, and CLAP for audio features. Furthermore, the CLIP and CLAP text encoders provide class label embeddings which are combined to boost the performance of the system. We propose a simple yet effective model that only relies on feed-forward neural networks, exploiting the strong generalization capabilities of the new audio, visual and textual features. Our framework achieves state-of-the-art performance on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL with our new features. Code and data available at: https://github.com/dkurzend/ClipClap-GZSL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (94)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Deep audio-visual speech recognition. In IEEE TPAMI, 2018.
  3. Asr is all you need: Cross-modal distillation for lip reading. In ICASSP, 2020a.
  4. Self-supervised learning of audio-visual objects from video. In ECCV, 2020b.
  5. Self-supervised object detection from audio-visual correspondence. In CVPR, 2022.
  6. Label-embedding for image classification. In IEEE TPAMI, 2015a.
  7. Evaluation of output embeddings for fine-grained image classification. In CVPR, 2015b.
  8. Self-supervised learning by cross-modal audio-video clustering. In NeurIPS, 2020.
  9. Objects that sound. In ECCV, 2018.
  10. Labelling unlabelled videos from scratch with multi-modal self-supervision. In NeurIPS, 2020.
  11. Soundnet: Learning sound representations from unlabeled video. In NeurIPS, 2016.
  12. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  13. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In ECCV, 2016.
  14. Audio-visual synchronisation in the wild. In BMVC, 2021a.
  15. Localizing visual sounds the hard way. In CVPR, 2021b.
  16. Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection. In ICASSP, 2022.
  17. Distilling audio-visual knowledge by compositional contrastive learning. In CVPR, 2021c.
  18. Look, listen, and attend: Co-attention network for self-supervised audio-visual representation learning. In ACM-MM, 2020.
  19. Out of time: automated lip sync in the wild. In ACCV, 2017.
  20. Decoupling zero-shot semantic segmentation. In CVPR, 2022.
  21. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  22. Detection of audio-video synchronization errors via event detection. In ICASSP, 2021.
  23. Clap learning audio concepts from natural language supervision. In ICASSP, 2023.
  24. Devise: A deep visual-semantic embedding model. In NeurIPS, 2013.
  25. Foley music: Learning to generate music from videos. In ECCV, 2020.
  26. Co-separating sounds of visual objects. In ICCV, 2019.
  27. Guitar music transcription from silent video. In BMVC, 2018.
  28. Ast: Audio spectrogram transformer. In INTERSPEECH, 2021.
  29. Shreyank N Gowda. Synthetic sample selection for generalized zero-shot learning. In CVPR, 2023.
  30. Generative multi-label zero-shot learning. In IEEE TPAMI, 2023.
  31. Learning generalized zero-shot learners for open-domain image geolocalization. arXiv preprint arXiv:2302.00275, 2023.
  32. Contrastive generative network with recursive-loop for 3d point cloud generalized zero-shot classification. In Pattern Recognition, 2023.
  33. Cnn architectures for large-scale audio classification. In ICASSP, 2017.
  34. Hyperbolic audio-visual zero-shot learning. In ICCV, 2023.
  35. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  36. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
  37. On attention modules for audio-visual synchronization. In CVPRW, 2019.
  38. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  39. Visual pitch estimation. In SMC, 2019.
  40. Sight to sound: An end-to-end approach for visual piano transcription. In ICASSP, 2020.
  41. Cooperative learning of audio and video models from self-supervised synchronization. In NeurIPS, 2018.
  42. Knowledge transfer from weakly labeled audio using convolutional neural network for sound events and scenes. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 326–330. IEEE, 2018.
  43. Attribute-based classification for zero-shot visual object categorization. In IEEE TPAMI, 2013.
  44. Learning visual n-grams from web data. In ICCV, 2017.
  45. Rs-clip: Zero shot remote sensing scene classification via contrastive vision-language supervision. In Int. J. Appl. Earth Obs. Geoinf., 2023.
  46. Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, 2023.
  47. Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In ICML, 2023.
  48. End-to-end audio-visual speech recognition with conformers. In ICASSP, 2021.
  49. Does clip’s generalization performance mainly stem from high train-test similarity? In ICLR, 2024.
  50. Avgzslnet: Audio-visual generalized zero-shot learning by reconstructing label features from multi-modal embeddings. In WACV, 2021.
  51. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv preprint arXiv:2303.17395, 2023.
  52. Temporal and cross-modal attention for audio-visual zero-shot learning. In ECCV, 2022a.
  53. Audio-visual generalised zero-shot learning with cross-modal attention and language. In CVPR, 2022b.
  54. Text-to-feature diffusion for audio-visual few-shot learning. In DAGM GCPR, 2023.
  55. Efficient estimation of word representations in vector space. In ICLR Workshop, 2013.
  56. Seeing wake words: Audio-visual keyword spotting. In BMVC, 2020.
  57. Disentangled speech embeddings using cross-modal self-supervision. In ICASSP, 2020.
  58. Attention bottlenecks for multimodal fusion. In NeurIPS, 2021.
  59. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
  60. Strumming to the beat: Audio-conditioned contrastive video textures. In WACV, 2022.
  61. Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650, 2013.
  62. Chils: Zero-shot image classification with hierarchical label sets. In ICML, 2023.
  63. Audio-visual scene analysis with self-supervised multisensory features. In ECCV, 2018.
  64. Ambient sound provides supervision for visual learning. In ECCV, 2016.
  65. Learning sight from sound: Ambient sound provides supervision for visual learning. In IJCV, 2018.
  66. Coordinated joint multimodal embeddings for generalized audio-visual zero-shot classification and retrieval of videos. In WACV, 2020.
  67. Multi-modal self-supervision from generalized data transformations. In ICCV, 2021.
  68. Combined scaling for zero-shot transfer learning. In Neurocomputing, 2023.
  69. Multiple sound sources localization from coarse to fine. In ECCV, 2020.
  70. Learning transferable visual models from natural language supervision. In ICML, 2021.
  71. An embarrassingly simple approach to zero-shot learning. In ICML, 2015.
  72. Dropout: a simple way to prevent neural networks from overfitting. In JMLR, 2014.
  73. Multi-instrumentalist net: Unsupervised generation of music from body movements. arXiv preprint arXiv:2012.03478, 2020.
  74. How does it sound? generation of rhythmic soundtracks for human movement videos. 2021.
  75. Audio-visual event localization in unconstrained videos. In ECCV, 2018.
  76. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
  77. Into the wild with audioscope: Unsupervised audio-visual separation of on-screen sounds. In ICLR, 2021.
  78. Attention is all you need. In NeurIPS, 2017.
  79. Generalized zero-shot learning via synthesized examples. In CVPR, 2018.
  80. Clipn for zero-shot ood detection: Teaching clip to say no. In ICCV, 2023.
  81. Im2city: image geo-localization via multi-modal learning. In ACM SIGSPATIAL, 2022.
  82. Latent embeddings for zero-shot classification. In CVPR, 2016.
  83. Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. In IEEE TPAMI, 2018a.
  84. Feature generating networks for zero-shot learning. In CVPR, 2018b.
  85. Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740, 2020.
  86. Cross-modal relation-aware networks for audio-visual event localization. In ACM-MM, 2020.
  87. A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In ECCV, 2022.
  88. Coca: Contrastive captioners are image-text foundation models. 2022.
  89. The sound of pixels. In ECCV, 2018.
  90. The sound of motions. In ICCV, 2019.
  91. Vision-infused deep audio inpainting. In ICCV, 2019.
  92. Zegclip: Towards adapting clip for zero-shot semantic segmentation. In CVPR, 2023.
  93. V-slowfast network for efficient visual sound separation. In WACV, 2022.
  94. Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks. In CVPR, 2022.

Summary

We haven't generated a summary for this paper yet.