Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 45 tok/s Pro
GPT-5 Medium 37 tok/s
GPT-5 High 35 tok/s Pro
GPT-4o 105 tok/s
GPT OSS 120B 463 tok/s Pro
Kimi K2 235 tok/s Pro
2000 character limit reached

From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation (2409.19132v1)

Published 27 Sep 2024 in cs.MM, cs.CV, cs.LG, cs.SD, and eess.AS

Abstract: Video encompasses both visual and auditory data, creating a perceptually rich experience where these two modalities complement each other. As such, videos are a valuable type of media for the investigation of the interplay between audio and visual elements. Previous studies of audio-visual modalities primarily focused on either audio-visual representation learning or generative modeling of a modality conditioned on the other, creating a disconnect between these two branches. A unified framework that learns representation and generates modalities has not been developed yet. In this work, we introduce a novel framework called Vision to Audio and Beyond (VAB) to bridge the gap between audio-visual representation learning and vision-to-audio generation. The key approach of VAB is that rather than working with raw video frames and audio data, VAB performs representation learning and generative modeling within latent spaces. In particular, VAB uses a pre-trained audio tokenizer and an image encoder to obtain audio tokens and visual features, respectively. It then performs the pre-training task of visual-conditioned masked audio token prediction. This training strategy enables the model to engage in contextual learning and simultaneous video-to-audio generation. After the pre-training phase, VAB employs the iterative-decoding approach to rapidly generate audio tokens conditioned on visual features. Since VAB is a unified model, its backbone can be fine-tuned for various audio-visual downstream tasks. Our experiments showcase the efficiency of VAB in producing high-quality audio from video, and its capability to acquire semantic audio-visual features, leading to competitive results in audio-visual retrieval and classification.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Musiclm: Generating music from text, 2023.
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  3. Look, listen and learn. In Proceedings of the IEEE international conference on computer vision, pp.  609–617, 2017.
  4. Objects that sound. In Proceedings of the European conference on computer vision (ECCV), pp.  435–451, 2018.
  5. Soundnet: Learning sound representations from unlabeled video. Advances in neural information processing systems, 29, 2016.
  6. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35:32897–32912, 2022.
  7. Bidirectional recurrent neural networks as generative models. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/c75b6f114c23a4d7ea11331e7c00e73c-Paper.pdf.
  8. Soundstorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636, 2023.
  9. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11315–11325, 2022.
  10. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  11. Vggsound: A large-scale audio-visual dataset, 2020.
  12. Simple and controllable music generation. arXiv preprint arXiv:2306.05284, 2023.
  13. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
  14. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  15. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020.
  16. Video background music generation with controllable music transformer. In Proceedings of the 29th ACM International Conference on Multimedia, pp.  2037–2045, 2021.
  17. Musechat: A conversational music recommendation system for videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  12775–12785, June 2024.
  18. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  19. Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023.
  20. Large scale audiovisual learning of sounds with weakly labeled data. arXiv preprint arXiv:2006.01595, 2020.
  21. Foley music: Learning to generate music from videos. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp.  758–775. Springer, 2020.
  22. Vampnet: Music generation via masked acoustic token modeling. arXiv preprint arXiv:2307.04686, 2023.
  23. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  776–780. IEEE, 2017.
  24. Audiovisual masked autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  16144–16154, 2023.
  25. Imagebind: One embedding space to bind them all, 2023.
  26. Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778, 2021.
  27. Ssast: Self-supervised audio spectrogram transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  10699–10709, 2022a.
  28. Contrastive audio-visual masked autoencoder. arXiv preprint arXiv:2210.07839, 2022b.
  29. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  16000–16009, 2022.
  30. Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pp.  131–135. IEEE, 2017.
  31. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  32. Mavil: Masked audio-video learners. arXiv preprint arXiv:2212.08071, 2022a.
  33. Masked autoencoders that listen. Advances in Neural Information Processing Systems, 35:28708–28720, 2022b.
  34. Masked autoencoders that listen, 2023.
  35. Taming visually guided sound generation, 2021.
  36. Perceiver: General perception with iterative attention. In International conference on machine learning, pp.  4651–4664. PMLR, 2021.
  37. Fr\\\backslash\’echet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2018.
  38. Deep learning for robust feature generation in audiovisual emotion recognition. In 2013 IEEE international conference on acoustics, speech and signal processing, pp.  3687–3691. IEEE, 2013.
  39. Efficient training of audio transformers with patchout. arXiv preprint arXiv:2110.05069, 2021.
  40. Audiogen: Textually guided audio generation. arXiv preprint arXiv:2209.15352, 2022.
  41. High-fidelity audio compression with improved rvqgan. arXiv preprint arXiv:2306.06546, 2023.
  42. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  43. Mage: Masked generative encoder to unify representation learning and image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2142–2152, 2023b.
  44. Tackling data bias in music-avqa: Crafting a balanced dataset for unbiased question-answering. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp.  4478–4487, January 2024a.
  45. Let the beat follow you - creating interactive drum sounds from body rhythm. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp.  7187–7197, January 2024b.
  46. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. arXiv preprint arXiv:2306.17203, 2023.
  47. Active contrastive learning of audio-visual video representations. arXiv preprint arXiv:2009.09805, 2020.
  48. Samplernn: An unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837, 2016.
  49. Foleygen: Visually-guided audio generation. arXiv preprint arXiv:2309.10537, 2023.
  50. Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12475–12486, 2021.
  51. Attention bottlenecks for multimodal fusion. Advances in Neural Information Processing Systems, 34:14200–14213, 2021.
  52. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pp.  689–696, 2011.
  53. Visually indicated sounds. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2405–2413, 2016.
  54. Piczak, K. J. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pp.  1015–1018, 2015.
  55. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  56. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  57. I hear your true colors: Image guided audio generation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023.
  58. Predict & cluster: Unsupervised skeleton based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020a.
  59. Audeo: Audio generation for a silent performance video. Advances in Neural Information Processing Systems, 33:3325–3337, 2020b.
  60. Multi-instrumentalist net: Unsupervised generation of music from body movements. arXiv preprint arXiv:2012.03478, 2020c.
  61. How does it sound? Advances in Neural Information Processing Systems, 34:29258–29273, 2021.
  62. V2meow: Meowing to the visual beat via music generation. arXiv preprint arXiv:2305.06594, 2023a.
  63. Physics-driven diffusion models for impact sound synthesis from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9749–9759, 2023b.
  64. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training, 2022.
  65. What makes training multi-modal classification networks hard? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  12695–12705, 2020.
  66. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022.
  67. Warden, P. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018.
  68. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  5288–5296, 2016.
  69. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  70. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10459–10469, 2023.
  71. Visual to sound: Generating natural sound for videos in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3550–3558, 2018.
  72. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852, 2023.
Citations (1)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube