Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FoleyGen: Visually-Guided Audio Generation (2309.10537v1)

Published 19 Sep 2023 in eess.AS, cs.MM, and cs.SD

Abstract: Recent advancements in audio generation have been spurred by the evolution of large-scale deep learning models and expansive datasets. However, the task of video-to-audio (V2A) generation continues to be a challenge, principally because of the intricate relationship between the high-dimensional visual and auditory data, and the challenges associated with temporal synchronization. In this study, we introduce FoleyGen, an open-domain V2A generation system built on a LLMing paradigm. FoleyGen leverages an off-the-shelf neural audio codec for bidirectional conversion between waveforms and discrete tokens. The generation of audio tokens is facilitated by a single Transformer model, which is conditioned on visual features extracted from a visual encoder. A prevalent problem in V2A generation is the misalignment of generated audio with the visible actions in the video. To address this, we explore three novel visual attention mechanisms. We further undertake an exhaustive evaluation of multiple visual encoders, each pretrained on either single-modal or multi-modal tasks. The experimental results on VGGSound dataset show that our proposed FoleyGen outperforms previous systems across all objective metrics and human evaluations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. “AudioLDM: Text-to-audio generation with latent diffusion models,” Proceedings of the International Conference on Machine Learning, 2023.
  2. “AudioGen: Textually guided audio generation,” in The Eleventh International Conference on Learning Representations, 2023.
  3. “AudioLDM 2: Learning holistic audio generation with self-supervised pretraining,” arXiv preprint arXiv:2308.05734, 2023.
  4. “Simple and controllable music generation,” arXiv preprint arXiv:2306.05284, 2023.
  5. “Visually indicated sounds,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  6. “Visually indicated sound generation by perceptually optimized classification,” in Proceedings of the European Conference on Computer Vision Workshops, 2018.
  7. “Visual to sound: Generating natural sound for videos in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 3550–3558.
  8. “Generating visually aligned sound from videos,” IEEE Transactions on Image Processing, vol. 29, pp. 8292–8302, 2020.
  9. “Taming visually guided sound generation,” in British Machine Vision Conference (BMVC), 2021.
  10. “I hear your true colors: Image guided audio generation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  11. “Diff-Foley: Synchronized video-to-audio synthesis with latent diffusion models,” arXiv preprint arXiv:2306.17203, 2023.
  12. “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438, 2022.
  13. “Audiolm: a language modeling approach to audio generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  14. “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
  15. “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 8748–8763.
  16. “ImageBind: One embedding space to bind them all,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 15180–15190.
  17. “VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training,” Advances in Neural Information Processing Systems, vol. 35, pp. 10078–10093, 2022.
  18. “Attention is all you need,” in Advances in Neural Information Processing Systems. 2017, vol. 30, Curran Associates, Inc.
  19. “VggSound: A large-scale audio-visual dataset,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 721–725.
  20. “FlashAttention: Fast and memory-efficient exact attention with io-awareness,” Advances in Neural Information Processing Systems, vol. 35, pp. 16344–16359, 2022.
  21. “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022.
  22. “Fréchet audio distance: A metric for evaluating music enhancement algorithms,” arXiv preprint arXiv:1812.08466, 2018.
  23. “CNN architectures for large-scale audio classification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 131–135.
  24. “Efficient training of audio transformers with patchout,” in Interspeech. 2022, pp. 2753–2757, ISCA.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xinhao Mei (24 papers)
  2. Varun Nagaraja (9 papers)
  3. Zhaoheng Ni (32 papers)
  4. Ernie Chang (34 papers)
  5. Yangyang Shi (54 papers)
  6. Vikas Chandra (75 papers)
  7. Gael Le Lan (23 papers)
Citations (11)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com