Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unified Video-Language Pre-training with Synchronized Audio (2405.07202v1)

Published 12 May 2024 in cs.CV, cs.AI, cs.LG, cs.MM, cs.SD, and eess.AS

Abstract: Video-language pre-training is a typical and challenging problem that aims at learning visual and textual representations from large-scale data in a self-supervised way. Existing pre-training approaches either captured the correspondence of image-text pairs or utilized temporal ordering of frames. However, they do not explicitly explore the natural synchronization between audio and the other two modalities. In this work, we propose an enhanced framework for Video-Language pre-training with Synchronized Audio, termed as VLSA, that can learn tri-modal representations in a unified self-supervised transformer. Specifically, our VLSA jointly aggregates embeddings of local patches and global tokens for video, text, and audio. Furthermore, we utilize local-patch masked modeling to learn modality-aware features, and leverage global audio matching to capture audio-guided features for video and text. We conduct extensive experiments on retrieval across text, video, and audio. Our simple model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines. In addition, qualitative visualizations vividly showcase the superiority of our VLSA in learning discriminative visual-textual representations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. Benchmarking weakly-supervised audio-visual sound localization. In European Conference on Computer Vision (ECCV) Workshop, 2022.
  2. Semantic-aware multi-modal grouping for weakly-supervised audio-visual video parsing. In European Conference on Computer Vision (ECCV) Workshop, 2022.
  3. DiffAVA: Personalized text-to-audio generation with visual alignment. arXiv preprint arXiv:2305.12903, 2023.
  4. Class-incremental grouping network for continual audio-visual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  5. Audio-visual class-incremental learning. In IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  6. A unified audio-visual learning framework for localization, separation, and recognition. arXiv preprint arXiv:2305.19458, 2023.
  7. Text-to-audio generation synchronized with videos. arXiv preprint arXiv:2403.07938, 2024.
  8. Audio-visual event localization in unconstrained videos. In Proceedings of European Conference on Computer Vision (ECCV), 2018.
  9. Dual-modality seq2seq network for audio-visual event localization. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2002–2006, 2019.
  10. Dual attention matching for audio-visual event localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 6291–6299, 2019.
  11. Audiovisual transformer with instance attention for audio-visual event localization. In Proceedings of the Asian Conference on Computer Vision (ACCV), 2020.
  12. Multi-modal grouping network for weakly-supervised audio-visual video parsing. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
  13. Unveiling the power of audio-visual early fusion transformers with dense interactions through masked modeling. arXiv preprint arXiv:2312.01017, 2023.
  14. Self-supervised generation of spatial audio for 360°video. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2018.
  15. 2.5d visual sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 324–333, 2019.
  16. Soundspaces: Audio-visual navigation in 3d environments. In Proceedings of European Conference on Computer Vision (ECCV), pages 17–36, 2020.
  17. Learning representations from audio-visual spatial alignment. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), pages 4733–4744, 2020.
  18. Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4358–4366, 2018.
  19. Deep multimodal clustering for unsupervised audiovisual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9248–9257, 2019.
  20. Self-supervised learning of audio-visual objects from video. In Proceedings of European Conference on Computer Vision (ECCV), pages 208–224, 2020.
  21. Multiple sound sources localization from coarse to fine. In Proceedings of European Conference on Computer Vision (ECCV), pages 292–308, 2020.
  22. Localizing visual sounds the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16867–16876, 2021.
  23. Localizing visual sounds the easy way. arXiv preprint arXiv:2203.09324, 2022.
  24. A closer look at weakly-supervised audio-visual source localization. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
  25. Audio-visual grouping network for sound localization from mixtures. arXiv preprint arXiv:2303.17056, 2023.
  26. AV-SAM: Segment anything model meets audio-visual localization and segmentation. arXiv preprint arXiv:2305.01836, 2023.
  27. Weakly-supervised audio-visual segmentation. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2023.
  28. Use what you have: Video retrieval using representations from collaborative experts. In Proceedings of British Machine Vision Conference (BMVC), 2019.
  29. Multi-modal Transformer for Video Retrieval. In Proceedings of European Conference on Computer Vision (ECCV), 2020.
  30. Linchao Zhu and Yi Yang. Actbert: Learning global-local video-text representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8743–8752, 2020.
  31. VideoBERT: A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766, 2019.
  32. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2021.
  33. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv preprint arXiv:2109.04290, 2021.
  34. Object-aware video-language pre-training for retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  35. All in one: Exploring unified video-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  36. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  37. Clip2tv: An empirical study on transformer-based methods for video-text retrieval. arXiv preprint arXiv:2111.05610, 2021.
  38. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2021.
  39. Audioclip: Extending clip to image, text and audio. arXiv preprint arXiv:2106.13043, 2021.
  40. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16354–16366, 2022.
  41. Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516, 2018.
  42. ImageBERT: cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966, 2020.
  43. KVL-BERT: knowledge enhanced visual-and-linguistic BERT for visual commonsense reasoning. arXiv preprint arXiv:2012.07000, 2020.
  44. Ernie-vil: Knowledge enhanced vision-language representations through scene graph. arXiv preprint arXiv:2006.16934, 2020.
  45. UNITER: learning universal image-text representations. arXiv preprint arXiv:1909.11740, 2019.
  46. LXMERT: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019.
  47. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
  48. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pages 13–23, 2019.
  49. VL-BERT: Pre-training of generic visual-linguistic representations. In International Conference on Learning Representations, 2020.
  50. Contrastive visual-linguistic pretraining. arXiv preprint arXiv:2007.13135, 2020.
  51. VLN BERT: a recurrent vision-and-language bert for navigation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  52. Less is More: clipbert for video-and-language learning via sparse sampling. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  53. Miles: Visual bert pre-training with injected language semantics for video-text retrieval. In Proceedings of European Conference on Computer Vision (ECCV), page 691–708, 2022.
  54. Audio retrieval with natural language queries. In Proceedings of Interspeech, 2021.
  55. Audio retrieval with natural language queries: A benchmark study. IEEE Transactions on Multimedia, 2022.
  56. Video-text pre-training with learned regions. arXiv preprint arXiv:2112.01194, 2021.
  57. i-code: An integrative and composable multimodal learning framework. arXiv preprint arXiv:2205.01818, 2022.
  58. Tvlt: Textless vision-language transformer. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
  59. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  60. A joint sequence fusion model for video question answering and retrieval. In Proceedings of European Conference on Computer Vision (ECCV), pages 471–487, 2018.
  61. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2630–2640, 2019.
  62. Avlnet: Learning audio-visual language representations from instructional videos. arXiv preprint arXiv:2006.09199, 2020.
  63. HERO: Hierarchical encoder for Video+Language omni-representation pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2046–2065, 2020.
  64. Video understanding as machine translation. arXiv preprint arXiv:2006.07203, 2020.
  65. Noise estimation using density estimation for self-supervised multimodal learning. arXiv preprint arXiv:2003.03186, 2020.
  66. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
  67. Everything at once-multi-modal fusion transformer for video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20020–20029, 2022.
  68. Support-set bottlenecks for video-text representation learning. In Proceedings of International Conference on Learning Representations (ICLR), 2021.
  69. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
  70. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5288–5296, 2016.
  71. A dataset for movie description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3202–3212, June 2015.
  72. AudioCaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119–132, 2019.
  73. Towards automatic learning of procedures from web instructional videos. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, page 7590–7598, 2018.
  74. The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV), pages 570–586, 2018.
  75. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
  76. MAE-AST: masked autoencoding audio spectrogram transformer. arXiv preprint arXiv:2203.16691, 2022.
  77. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  78. Decoupled weight decay regularization. In Proceedings of International Conference on Learning Representations (ICLR), 2019.
  79. Self-Supervised MultiModal Versatile Networks. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2020.
  80. Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(86):2579–2605, 2008.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Shentong Mo (56 papers)
  2. Haofan Wang (32 papers)
  3. Huaxia Li (17 papers)
  4. Xu Tang (48 papers)
Citations (2)