Video Background Music Generation with Controllable Music Transformer (2111.08380v1)

Published 16 Nov 2021 in cs.MM, cs.SD, and eess.AS

Abstract: In this work, we address the task of video background music generation. Some previous works achieve effective music generation but are unable to generate melodious music tailored to a particular video, and none of them considers the video-music rhythmic consistency. To generate the background music that matches the given video, we first establish the rhythmic relations between video and background music. In particular, we connect timing, motion speed, and motion saliency from video with beat, simu-note density, and simu-note strength from music, respectively. We then propose CMT, a Controllable Music Transformer that enables local control of the aforementioned rhythmic features and global control of the music genre and instruments. Objective and subjective evaluations show that the generated background music has achieved satisfactory compatibility with the input videos, and at the same time, impressive music quality. Code and models are available at https://github.com/wzk1015/video-bgm-generation.

Authors (8)

Shangzhe Di (8 papers)
Zeren Jiang (9 papers)
Si Liu (130 papers)
Zhaokai Wang (15 papers)
Leyan Zhu (2 papers)
Zexin He (6 papers)
Hongming Liu (2 papers)
Shuicheng Yan (275 papers)

Citations (75)

View on Semantic Scholar

Summary

An Examination of Video Background Music Generation with Controllable Music Transformer

The paper presents a significant development in the automatic generation of video background music through the introduction of the Controllable Music Transformer (CMT). The research addresses the underexplored problem of generating background music tailored specifically to a given video, with a particular focus on establishing rhythmic harmony between video content and musical accompaniment. The key innovation lies in the articulation and harnessing of rhythmic features derived from video properties and transforming them into corresponding musical features using a transformer model.

Core Contributions and Methodology

Rhythmic Correlations: The authors identify and formalize three core rhythmic relationships between video and audio elements. The timing, motion speed, and motion saliency of the video align with beat timing, simu-note density, and simu-note strength in music, respectively. This correlation ensures that audial responses are appropriate to the visual stimuli presented within the video.
Controllable Music Transformer: At the heart of this methodology is the CMT, a model enabling local control over rhythmic features and global control of music genre and instrumentation. Unlike previous work requiring vast datasets of annotated video-music pairs, this approach leverages unannotated datasets, focusing on the rhythmic attributes for music generation.
Music Representation and Attributes: The paper introduces a comprehensive framework that organizes musical attributes into compound word representations. This allows for the concise encapsulation of essential musical information, including pitch, duration, and instrumental type, facilitating more efficient music generation.

Experimental Evaluation

The authors validate their approach through both objective metrics and a subjective user paper. Objective evaluations demonstrated improvements over baseline models in terms of compatibility with video rhythm and overall music quality. The alignment of video traits with musical elements is quantitatively measured, showcasing the model's precision in adhering to the outlined video-music correlations.

The subjective evaluation further entails user assessments of the generated music's richness, correctness, and structuredness, alongside its rhythmicity and correspondence with the video. By targeting both the melodic integrity of music and its synchronized experience with visual content, the paper encapsulates an exhaustive analysis of perceived quality and applicability.

Implications and Future Directions

The implications of this work are substantial, particularly in the enhancement of multimedia experiences where synchronized audio-visual integration is pivotal. This research presents a strong case for AI-driven music creation that respects the sonic landscape's emotional and rhythmic integrity in correlation with video.

In terms of future developments, interesting avenues include the exploration of more abstract dimensions such as emotion and style in the relationship between visual content and generated audio. Additionally, the integration of audio in waveform form and unsupervised learning approaches offer promising potential for refining synchronized music generation.

This research exemplifies the viability and effectiveness of leveraging advanced AI models such as transformers for multimedia applications, marking a considerable step forward in the domain of automated background music generation. By methodologically linking video features to their musical counterparts, this work not only enhances the computational model's understanding of cross-modal relations but also extends the boundaries of AI's role in creative industries.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - wzk1015/video-bgm-generation: Video Background Music Generation with Controllable Music Transformer (ACM MM 2021 Best Paper Award) (298 stars)

Tweets

https://twitter.com/_akhaliq/status/1460791557322428418