Diff-BGM: A Diffusion Model for Video Background Music Generation (2405.11913v1)
Abstract: When editing a video, a piece of attractive background music is indispensable. However, video background music generation tasks face several challenges, for example, the lack of suitable training datasets, and the difficulties in flexibly controlling the music generation process and sequentially aligning the video and music. In this work, we first propose a high-quality music-video dataset BGM909 with detailed annotation and shot detection to provide multi-modal information about the video and music. We then present evaluation metrics to assess music quality, including music diversity and alignment between music and video with retrieval precision metrics. Finally, we propose the Diff-BGM framework to automatically generate the background music for a given video, which uses different signals to control different aspects of the music during the generation process, i.e., uses dynamic video features to control music rhythm and semantic features to control the melody and atmosphere. We propose to align the video and music sequentially by introducing a segment-aware cross-attention layer. Experiments verify the effectiveness of our proposed method. The code and models are available at https://github.com/sizhelee/Diff-BGM.
- S. Di, Z. Jiang, S. Liu, Z. Wang, L. Zhu, Z. He, H. Liu, and S. Yan, “Video background music generation with controllable music transformer,” Proceedings of the 29th ACM International Conference on Multimedia, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:239011657
- L. Zhuo, Z. Wang, B. Wang, Y. Liao, C. Bao, S. Peng, S. Han, A. Zhang, F. Fang, and S. Liu, “Video background music generation: Dataset, method and evaluation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 637–15 647.
- J. Kang, S. Poria, and D. Herremans, “Video2music: Suitable music generation from videos using an affective multimodal transformer model,” ArXiv, vol. abs/2311.00968, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:264935347
- M. Plitsis, T. Kouzelis, G. Paraskevopoulos, V. Katsouros, and Y. Panagakis, “Investigating personalization methods in text to music generation,” ArXiv, vol. abs/2309.11140, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:262067849
- K. Chen, Y. Wu, H. Liu, M. Nezhurina, T. Berg-Kirkpatrick, and S. Dubnov, “Musicldm: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies,” ArXiv, vol. abs/2308.01546, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:260438807
- L. Min, J. Jiang, G. G. Xia, and J. Zhao, “Polyffusion: A diffusion model for polyphonic score generation with internal and external controls,” ArXiv, vol. abs/2307.10304, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259991425
- M. Plasser, S. D. Peter, and G. Widmer, “Discrete diffusion probabilistic models for symbolic music generation,” ArXiv, vol. abs/2305.09489, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258715306
- G. Mariani, I. Tallini, E. Postolache, M. Mancusi, L. D. Cosmo, and E. Rodolà, “Multi-source diffusion models for simultaneous music generation and separation,” ArXiv, vol. abs/2302.02257, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:256615445
- F. Schneider, O. Kamal, Z. Jin, and B. Scholkopf, “Moûsai: Text-to-music generation with long-context latent diffusion,” 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:264439679
- K. Maina, “Msanii: High fidelity music synthesis on a shoestring budget,” ArXiv, vol. abs/2301.06468, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:255942360
- Y. Zhu, Y. Wu, K. Olszewski, J. Ren, S. Tulyakov, and Y. Yan, “Discrete contrastive diffusion for cross-modal music and image generation,” in International Conference on Learning Representations, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:256900618
- A. Lv, X. Tan, P. Lu, W. Ye, S. Zhang, J. Bian, and R. Yan, “Getmusic: Generating any music tracks with a unified representation and diffusion framework,” ArXiv, vol. abs/2305.10841, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258762664
- Z. Wang, K. Chen, J. Jiang, Y. Zhang, M. Xu, S. Dai, X. Gu, and G. G. Xia, “Pop909: A pop-song dataset for music arrangement generation,” in International Society for Music Information Retrieval Conference, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:221140193
- O. Patashnik, D. Garibi, I. Azuri, H. Averbuch-Elor, and D. Cohen-Or, “Localizing object-level shape variations with text-to-image diffusion models,” ArXiv, vol. abs/2303.11306, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:257632209
- S.-L. Wu and Y.-H. Yang, “The jazz transformer on the front line: Exploring the shortcomings of ai-composed music through quantitative measures,” in International Society for Music Information Retrieval Conference, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:220961652
- L. Ruan, Y. Ma, H. Yang, H. He, B. Liu, J. Fu, N. J. Yuan, Q. Jin, and B. Guo, “Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10 219–10 228, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:254854449
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10 674–10 685, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:245335280
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” ArXiv, vol. abs/2006.11239, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:219955663
- B. Yu, P. Lu, R. Wang, W. Hu, X. Tan, W. Ye, S. Zhang, T. Qin, and T.-Y. Liu, “Museformer: Transformer with fine- and coarse-grained attention for music generation,” ArXiv, vol. abs/2210.10349, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:252993148
- P. Neves, J. Fornari, and J. B. Florindo, “Generating music with sentiment using transformer-gans,” in International Society for Music Information Retrieval Conference, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:254926400
- H.-W. Dong, K. Chen, S. Dubnov, J. McAuley, and T. Berg-Kirkpatrick, “Multitrack music transformer,” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:253224045
- W.-Y. Hsiao, J.-Y. Liu, Y.-C. Yeh, and Y.-H. Yang, “Compound word transformer: Learning to compose full-song music over dynamic directed hypergraphs,” in AAAI Conference on Artificial Intelligence, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:230799404
- S.-L. Wu and Y.-H. Yang, “Musemorphose: Full-song and fine-grained piano music style transfer with one transformer vae,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1953–1967, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:254877776
- A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi, M. Sharifi, N. Zeghidour, and C. H. Frank, “Musiclm: Generating music from text,” ArXiv, vol. abs/2301.11325, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:256274504
- C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.-Z. A. Huang, S. Dieleman, E. Elsen, J. Engel, and D. Eck, “Enabling factorized piano music modeling and generation with the maestro dataset,” ArXiv, vol. abs/1810.12247, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:53094405
- S. Hong, W. Im, and H. S. Yang, “Content-based video-music retrieval using soft intra-modal structure constraint,” arXiv: Computer Vision and Pattern Recognition, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:260454061
- Y. Zhu, K. Olszewski, Y. Wu, P. Achlioptas, M. Chai, Y. Yan, and S. Tulyakov, “Quantized gan for complex music generation from dance videos,” ArXiv, vol. abs/2204.00604, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:247922422
- R. Li, S. Yang, D. A. Ross, and A. Kanazawa, “Ai choreographer: Music conditioned 3d dance generation with aist++,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13 381–13 392, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:236882798
- B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sharma, “Creating a multitrack classical music performance dataset for multimodal music analysis: Challenges, insights, and applications,” IEEE Transactions on Multimedia, vol. 21, pp. 522–535, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID:23731269
- J. Pons and X. Serra, “musicnn: Pre-trained convolutional neural networks for music audio tagging,” ArXiv, vol. abs/1909.06654, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:202577416
- J. Li, D. Li, C. Xiong, and S. C. H. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International Conference on Machine Learning, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:246411402
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:6628106
- H. Xu, G. Ghosh, P.-Y. B. Huang, D. Okhonko, A. Aghajanyan, and F. M. L. Z. C. Feichtenhofer, “Videoclip: Contrastive pre-training for zero-shot video-text understanding,” in Conference on Empirical Methods in Natural Language Processing, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:238215257
- J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018. [Online]. Available: http://arxiv.org/abs/1810.04805
- L. van der Maaten and G. E. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008. [Online]. Available: https://api.semanticscholar.org/CorpusID:5855042
- H.-W. Dong, K. Chen, J. McAuley, and T. Berg-Kirkpatrick, “Muspy: A toolkit for symbolic music generation,” ArXiv, vol. abs/2008.01951, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:220969049
- L.-C. Yang and A. Lerch, “On the evaluation of generative models in music,” Neural Computing and Applications, vol. 32, pp. 4773 – 4784, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:53258271
- J. Yu, Y. Wang, X. Chen, X. Sun, and Y. Qiao, “Long-term rhythmic video soundtracker,” ArXiv, vol. abs/2305.01319, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258437212
- C. Gan, D. Huang, P. Chen, J. B. Tenenbaum, and A. Torralba, “Foley music: Learning to generate music from videos,” in European Conference on Computer Vision, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:220665375
- S. Forsgren and H. Martiros, “Riffusion - Stable diffusion for real-time music generation,” 2022. [Online]. Available: https://riffusion.com/about