Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
91 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
o3 Pro
5 tokens/sec
GPT-4.1 Pro
15 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

Diff-BGM: A Diffusion Model for Video Background Music Generation (2405.11913v1)

Published 20 May 2024 in cs.CV

Abstract: When editing a video, a piece of attractive background music is indispensable. However, video background music generation tasks face several challenges, for example, the lack of suitable training datasets, and the difficulties in flexibly controlling the music generation process and sequentially aligning the video and music. In this work, we first propose a high-quality music-video dataset BGM909 with detailed annotation and shot detection to provide multi-modal information about the video and music. We then present evaluation metrics to assess music quality, including music diversity and alignment between music and video with retrieval precision metrics. Finally, we propose the Diff-BGM framework to automatically generate the background music for a given video, which uses different signals to control different aspects of the music during the generation process, i.e., uses dynamic video features to control music rhythm and semantic features to control the melody and atmosphere. We propose to align the video and music sequentially by introducing a segment-aware cross-attention layer. Experiments verify the effectiveness of our proposed method. The code and models are available at https://github.com/sizhelee/Diff-BGM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. S. Di, Z. Jiang, S. Liu, Z. Wang, L. Zhu, Z. He, H. Liu, and S. Yan, “Video background music generation with controllable music transformer,” Proceedings of the 29th ACM International Conference on Multimedia, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:239011657
  2. L. Zhuo, Z. Wang, B. Wang, Y. Liao, C. Bao, S. Peng, S. Han, A. Zhang, F. Fang, and S. Liu, “Video background music generation: Dataset, method and evaluation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 637–15 647.
  3. J. Kang, S. Poria, and D. Herremans, “Video2music: Suitable music generation from videos using an affective multimodal transformer model,” ArXiv, vol. abs/2311.00968, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:264935347
  4. M. Plitsis, T. Kouzelis, G. Paraskevopoulos, V. Katsouros, and Y. Panagakis, “Investigating personalization methods in text to music generation,” ArXiv, vol. abs/2309.11140, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:262067849
  5. K. Chen, Y. Wu, H. Liu, M. Nezhurina, T. Berg-Kirkpatrick, and S. Dubnov, “Musicldm: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies,” ArXiv, vol. abs/2308.01546, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:260438807
  6. L. Min, J. Jiang, G. G. Xia, and J. Zhao, “Polyffusion: A diffusion model for polyphonic score generation with internal and external controls,” ArXiv, vol. abs/2307.10304, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259991425
  7. M. Plasser, S. D. Peter, and G. Widmer, “Discrete diffusion probabilistic models for symbolic music generation,” ArXiv, vol. abs/2305.09489, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258715306
  8. G. Mariani, I. Tallini, E. Postolache, M. Mancusi, L. D. Cosmo, and E. Rodolà, “Multi-source diffusion models for simultaneous music generation and separation,” ArXiv, vol. abs/2302.02257, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:256615445
  9. F. Schneider, O. Kamal, Z. Jin, and B. Scholkopf, “Moûsai: Text-to-music generation with long-context latent diffusion,” 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:264439679
  10. K. Maina, “Msanii: High fidelity music synthesis on a shoestring budget,” ArXiv, vol. abs/2301.06468, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:255942360
  11. Y. Zhu, Y. Wu, K. Olszewski, J. Ren, S. Tulyakov, and Y. Yan, “Discrete contrastive diffusion for cross-modal music and image generation,” in International Conference on Learning Representations, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:256900618
  12. A. Lv, X. Tan, P. Lu, W. Ye, S. Zhang, J. Bian, and R. Yan, “Getmusic: Generating any music tracks with a unified representation and diffusion framework,” ArXiv, vol. abs/2305.10841, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258762664
  13. Z. Wang, K. Chen, J. Jiang, Y. Zhang, M. Xu, S. Dai, X. Gu, and G. G. Xia, “Pop909: A pop-song dataset for music arrangement generation,” in International Society for Music Information Retrieval Conference, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:221140193
  14. O. Patashnik, D. Garibi, I. Azuri, H. Averbuch-Elor, and D. Cohen-Or, “Localizing object-level shape variations with text-to-image diffusion models,” ArXiv, vol. abs/2303.11306, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:257632209
  15. S.-L. Wu and Y.-H. Yang, “The jazz transformer on the front line: Exploring the shortcomings of ai-composed music through quantitative measures,” in International Society for Music Information Retrieval Conference, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:220961652
  16. L. Ruan, Y. Ma, H. Yang, H. He, B. Liu, J. Fu, N. J. Yuan, Q. Jin, and B. Guo, “Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10 219–10 228, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:254854449
  17. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10 674–10 685, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:245335280
  18. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” ArXiv, vol. abs/2006.11239, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:219955663
  19. B. Yu, P. Lu, R. Wang, W. Hu, X. Tan, W. Ye, S. Zhang, T. Qin, and T.-Y. Liu, “Museformer: Transformer with fine- and coarse-grained attention for music generation,” ArXiv, vol. abs/2210.10349, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:252993148
  20. P. Neves, J. Fornari, and J. B. Florindo, “Generating music with sentiment using transformer-gans,” in International Society for Music Information Retrieval Conference, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:254926400
  21. H.-W. Dong, K. Chen, S. Dubnov, J. McAuley, and T. Berg-Kirkpatrick, “Multitrack music transformer,” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:253224045
  22. W.-Y. Hsiao, J.-Y. Liu, Y.-C. Yeh, and Y.-H. Yang, “Compound word transformer: Learning to compose full-song music over dynamic directed hypergraphs,” in AAAI Conference on Artificial Intelligence, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:230799404
  23. S.-L. Wu and Y.-H. Yang, “Musemorphose: Full-song and fine-grained piano music style transfer with one transformer vae,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1953–1967, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:254877776
  24. A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi, M. Sharifi, N. Zeghidour, and C. H. Frank, “Musiclm: Generating music from text,” ArXiv, vol. abs/2301.11325, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:256274504
  25. C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.-Z. A. Huang, S. Dieleman, E. Elsen, J. Engel, and D. Eck, “Enabling factorized piano music modeling and generation with the maestro dataset,” ArXiv, vol. abs/1810.12247, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:53094405
  26. S. Hong, W. Im, and H. S. Yang, “Content-based video-music retrieval using soft intra-modal structure constraint,” arXiv: Computer Vision and Pattern Recognition, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:260454061
  27. Y. Zhu, K. Olszewski, Y. Wu, P. Achlioptas, M. Chai, Y. Yan, and S. Tulyakov, “Quantized gan for complex music generation from dance videos,” ArXiv, vol. abs/2204.00604, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:247922422
  28. R. Li, S. Yang, D. A. Ross, and A. Kanazawa, “Ai choreographer: Music conditioned 3d dance generation with aist++,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13 381–13 392, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:236882798
  29. B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sharma, “Creating a multitrack classical music performance dataset for multimodal music analysis: Challenges, insights, and applications,” IEEE Transactions on Multimedia, vol. 21, pp. 522–535, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID:23731269
  30. J. Pons and X. Serra, “musicnn: Pre-trained convolutional neural networks for music audio tagging,” ArXiv, vol. abs/1909.06654, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:202577416
  31. J. Li, D. Li, C. Xiong, and S. C. H. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International Conference on Machine Learning, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:246411402
  32. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:6628106
  33. H. Xu, G. Ghosh, P.-Y. B. Huang, D. Okhonko, A. Aghajanyan, and F. M. L. Z. C. Feichtenhofer, “Videoclip: Contrastive pre-training for zero-shot video-text understanding,” in Conference on Empirical Methods in Natural Language Processing, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:238215257
  34. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018. [Online]. Available: http://arxiv.org/abs/1810.04805
  35. L. van der Maaten and G. E. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008. [Online]. Available: https://api.semanticscholar.org/CorpusID:5855042
  36. H.-W. Dong, K. Chen, J. McAuley, and T. Berg-Kirkpatrick, “Muspy: A toolkit for symbolic music generation,” ArXiv, vol. abs/2008.01951, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:220969049
  37. L.-C. Yang and A. Lerch, “On the evaluation of generative models in music,” Neural Computing and Applications, vol. 32, pp. 4773 – 4784, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:53258271
  38. J. Yu, Y. Wang, X. Chen, X. Sun, and Y. Qiao, “Long-term rhythmic video soundtracker,” ArXiv, vol. abs/2305.01319, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258437212
  39. C. Gan, D. Huang, P. Chen, J. B. Tenenbaum, and A. Torralba, “Foley music: Learning to generate music from videos,” in European Conference on Computer Vision, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:220665375
  40. S. Forsgren and H. Martiros, “Riffusion - Stable diffusion for real-time music generation,” 2022. [Online]. Available: https://riffusion.com/about
Citations (2)

Summary

We haven't generated a summary for this paper yet.