V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models (2308.09300v4)
Abstract: Building AI systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in cross-modal generation remains under-researched when audio modality is involved. On the other hand, automatically generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. To solve this vision-to-audio (V2A) generation problem, existing methods tend to design and build complex systems from scratch using modestly sized datasets. In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM. We first investigate the domain gap between the latent space of the visual CLIP and the auditory CLAP models. Then we propose a simple yet effective mapper mechanism (V2A-Mapper) to bridge the domain gap by translating the visual input between CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained audio generative FM AudioLDM is adopted to produce high-fidelity and visually-aligned sound. Compared to previous approaches, our method only requires a quick training of the V2A-Mapper. We further analyze and conduct extensive experiments on the choice of the V2A-Mapper and show that a generative mapper is better at fidelity and variability (FD) while a regression mapper is slightly better at relevance (CS). Both objective and subjective evaluation on two V2A datasets demonstrate the superiority of our proposed method compared to current state-of-the-art approaches - trained with 86% fewer parameters but achieving 53% and 19% improvement in FD and CS, respectively.
- Agresti, A. 1992. Analysis of ordinal paired comparison data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 41(2): 287–297.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, volume 35, 23716–23736.
- SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing. In ACL (Long Papers), 5723–5738.
- Foundational models defining a new era in vision: A survey and outlook. arXiv preprint arXiv:2307.13721.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
- A comprehensive survey of AI-generated content (AIGC): A history of generative AI from GAN to ChatGPT. arXiv preprint arXiv:2303.04226.
- VGGSound: A large-scale audio-visual dataset. In ICASSP, 721–725. IEEE.
- Visually indicated sound generation by perceptually optimized classification. In ECCV Workshops.
- Deep cross-modal audio-visual generation. In Thematic Workshops of ACM Multimedia, 349–357.
- Generating visually aligned sound from videos. IEEE Transactions on Image Processing, 29: 8292–8302.
- Video background music generation with controllable music transformer. In ACM Multimedia, 2037–2045.
- CLIPSonic: Text-to-audio synthesis with unlabeled videos and pretrained language-vision models. arXiv preprint arXiv:2306.09635.
- Clotho: An audio captioning dataset. In ICASSP, 736–740. IEEE.
- A survey of vision-language pre-trained models. arXiv preprint arXiv:2202.10936.
- AudioSet: An ontology and human-labeled dataset for audio events. In ICASSP, 776–780. IEEE.
- Text-to-Audio Generation using Instruction Guided Latent Diffusion Model. In ACM Multimedia, 3590–3598.
- AutoFoley: Artificial synthesis of synchronized sound tracks for silent videos with deep learning. IEEE Transactions on Multimedia, 23: 1895–1907.
- CMCGAN: A uniform framework for cross-modal visual-audio mutual generation. In AAAI, volume 32.
- CNN architectures for large-scale audio classification. In ICASSP, 131–135. IEEE.
- Denoising diffusion probabilistic models. NeurIPS, 33: 6840–6851.
- Classifier-Free Diffusion Guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.
- Parameter-efficient transfer learning for NLP. In ICML, 2790–2799. PMLR.
- LoRA: Low-rank adaptation of large language models. In ICLR.
- Make-An-Audio: Text-to-audio generation with prompt-enhanced diffusion models. In ICML.
- Taming visually guided sound generation. In BMVC.
- International Telecommunication Union. 1996. ITU-T Recommendation P.800: Methods for Subjective Determination of Transmission Quality.
- Fréchet audio distance: A metric for evaluating music enhancement algorithms. Interspeech, 2350–2354.
- AudioCaps: Generating captions for audios in the wild. In ACL (Long and Short Papers), 119–132.
- PANNs: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28: 2880–2894.
- AudioGen: Textually guided audio generation. In ICLR.
- How many data points is a prompt worth? In ACL, 2627–2636.
- BigVGAN: A universal neural vocoder with large-scale training. arXiv preprint arXiv:2206.04658.
- Efficient domain adaptation for speech foundation models. In ICASSP, 1–5. IEEE.
- BlIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 12888–12900. PMLR.
- Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. In NeurIPS, volume 35, 17612–17625.
- AudioLDM: Text-to-audio generation with latent diffusion models. In ICML, 21450–21474. PMLR.
- AudioLDM 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734.
- Pretrained transformers as universal computation engines. In AAAI, 7628–7636.
- Diff-Foley: Synchronized video-to-audio synthesis with latent diffusion models. arXiv preprint arXiv:2306.17203.
- Catastrophic interference in connectionist networks: The sequential learning problem. The Psychology of Learning and Motivation, 24: 109–165.
- ClipCap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734.
- Improved denoising diffusion probabilistic models. In ICML, 8162–8171. PMLR.
- Audio-visual scene analysis with self-supervised multisensory features. In ECCV, 631–648.
- Visually indicated sounds. In CVPR, 2405–2413.
- Foundation models for natural language processing: Pre-trained language models integrating media. Springer Nature.
- Learning individual speaking styles for accurate lip to speech synthesis. In CVPR, 13796–13805.
- Learning transferable visual models from natural language supervision. In ICML, 8748–8763. PMLR.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
- Parametric UMAP embeddings for representation and semisupervised learning. Neural Computation, 33(11): 2881–2907.
- I hear your true colors: Image guided audio generation. In ICASSP, 1–5. IEEE.
- Score-based generative modeling through stochastic differential equations. In ICLR.
- Neural discrete representation learning. In NeurIPS, volume 30.
- Attention is all you need. In NeurIPS, volume 30.
- Learning in audio-visual context: A review, analysis, and new perspective. arXiv preprint arXiv:2208.09579.
- Wav2CLIP: Learning robust audio representations from CLIP. In ICASSP, 4563–4567. IEEE.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP, 1–5. IEEE.
- ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection. In Proc. ASVspoof Challenge Workshop, 47–54.
- DiffSound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
- An empirical study of GPT-3 for few-shot knowledge-based VQA. In AAAI, 3081–3089.
- A survey on multimodal large language models. arXiv preprint arXiv:2306.13549.
- Leveraging pre-trained AudioLDM for sound generation: A benchmark study. In EUSIPCO.
- BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In ACL (Short Papers), volume 2, 1–9.
- Meta-Transformer: A Unified Framework for Multimodal Learning. arXiv preprint arXiv:2307.10802.
- A comprehensive survey on pretrained foundation models: A history from BERT to ChatGPT. arXiv preprint arXiv:2302.09419.
- Visual to sound: Generating natural sound for videos in the wild. In CVPR, 3550–3558.
- VATLM: Visual-audio-text pre-training with unified masked prediction for speech representation learning. IEEE Transactions on Multimedia.
- Discrete contrastive diffusion for cross-modal and conditional generation. In ICLR.
- Heng Wang (136 papers)
- Jianbo Ma (9 papers)
- Santiago Pascual (30 papers)
- Richard Cartwright (8 papers)
- Weidong Cai (118 papers)