VideoScore-2: Automated Film Score Generation

Updated 31 December 2025

VideoScore-2 is a film score generation framework that uses latent diffusion modeling to align visual content with musical elements.
It integrates a multimodal film encoder combining semantic, aesthetic, and emotional branches to produce unified video representations.
The system enables efficient maestro-guided style transfer with local melody and dynamics controls, validated on the FilmScoreDB benchmark.

VideoScore-2 is a film score generation framework specifically designed to harmonize visual content and musical melodies through latent diffusion modeling. The system enables direct music synthesis from video, incorporating mechanisms to tailor outputs to particular composition styles, including guidance from a "maestro" reference clip. VideoScore-2 integrates a multimodal film encoder, parameter-efficient ControlNet tuning, local and global conditioning, and introduces novel evaluation metrics and a curated dataset (FilmScoreDB) to benchmark automated film scoring systems. Together, these advances establish a new standard for generative film scoring and style transfer in audiovisual composition (Qi et al., 2024).

1. Architectural Design and Workflow

VideoScore-2 accepts as input a video clip (10 seconds, 10 fps) and optionally style reference signals: melody $C_{mel} \in \mathbb{R}^{T \times 12}$ and dynamics $C_{dyn} \in \mathbb{R}^{T \times 1}$ derived from a “maestro” clip. The pipeline consists of several primary components:

Film Encoder: Extracts high-level semantic, aesthetic, and emotional representations from the video.
- Semantic Branch: Uses a CLIP image encoder, producing a 512-dimensional embedding per frame ( $c_s$ ), temporally averaged.
- Aesthetic Branch (TAVAR): ResNet-50 backbone with six MLPs, theme networks, GCN, and Swin Transformer, yielding a 512-dimensional aesthetic embedding ( $c_a$ ).
- Emotional Branch: WECL weakly-supervised detector outputs a one-hot emotion label, embedded to 512-dimensions ( $c_e$ ).
- Fusion (LAFF): Lightweight Attention Feature Fusion learns convex weights ( $\omega_s, \omega_a, \omega_e$ ) to combine $[c_s, c_a, c_e]$ into a global video control vector $c_{film} \in \mathbb{R}^{1\times 512}$ .
Local Style Controls:
- Melody ( $C_{mel}$ ): Chromagram extraction, collapsing F-bin energies into pitch classes.
- Dynamics ( $C_{dyn}$ ): Summed linear-spectrogram energy per frame, converted to dB, and smoothed.
Latent Diffusion Backbone (VM-Unet): Adapted AudioLDM UNet (U-shaped architecture) fine-tuned for video-to-audio correspondence.
Film Score ControlNet Branch: Copies weights from VM-Unet's Down and Middle blocks; attaches an “S-Control” branch with zero-initialized $1\times1$ $1 \times 1$ convolutions ( $Z_{in}, Z_{out}$ $Z_{in}, Z_{o u t}$ ), facilitating local and global conditioning.
- Local Control Injection: Adds normalized latent $z_t$ , convolved melody and dynamics signals at each block.
- Global Control: Implements cross-attention within each block: $Q = W_q(z_t)$ , $K = W_k(c_{film})$ , $V = W_v(c_{film})$ .
Trainable Parameters: Only the S-Control branch and LoRA adapters are updated during fine-tuning; VM-Unet weights are frozen, promoting efficiency (~20M trainable parameters vs. 87M in full fine-tuning).

2. Latent Diffusion Formulation

The generative process operates on VAE-compressed mel-spectrogram latents $z_0 \in \mathbb{R}^{C \times T/r \times F/r}$ :

Forward (Noising) Process:

$q(z_t | z_{t-1}) = \mathcal{N}\left(z_t;\sqrt{1-\beta_t}\cdot z_{t-1},\ \beta_t I\right)$
- Cumulative form: $z_t = \sqrt{\alpha_t} z_0 + \sqrt{1-\alpha_t} \epsilon$ , where $\alpha_t = \prod_{i=1}^t (1-\beta_i),\ \epsilon \sim \mathcal{N}(0,I)$
Reverse (Denoising) Process:

$p_\theta(z_{t-1} | z_t, c) = \mathcal{N}(z_{t-1};\, \mu_\theta(z_t, t, c),\, \Sigma_\theta(z_t, t, c))$
Training Objective:

$L(\theta) = \mathbb{E}_{t,z_0,\epsilon}\left[\| \epsilon - \epsilon_\theta(z_t, t, c)\|^2\right]$ (L₂ loss on noise, reweighted variational bound)
Classifier-Free Guidance: Interpolates predictions between conditioned and unconditioned generations to modulate control strength.

3. Parameter-Efficient ControlNet Tuning and Conditioning

The ControlNet tuning protocol draws from Uni-ControlNet:

S-Control Branch: Each Down/Middle VM-Unet block hosts an identical structure with zero-initialized $1\times1$ convolutional layers, ensuring initial neutrality.
Training: Only S-Control and LoRA parameters in transformer blocks are updated, drastically reducing trainable weights (20M vs. 87M).
Control Injection:
- Global Control via Cross-Attention: Integrates global film features at each block.
- Local Control via Direct Addition: Injects melody and dynamics signals.
This approach enables simultaneous spatial–temporal conditioning, with empirical efficiency gains and stability.

4. Film Encoder and Representation Fusion

The multimodal film encoder comprises three branches:

Semantic: CLIP ViT or ResNet-based encoders provide frame-level feature extraction and temporal pooling (512-D).
Aesthetic:
- VAAN computes visual attributes via ResNet-50 and per-frame MLP heads.
- TUN extracts theme features similarly.
- Bilevel GCN connects theme and attribute nodes, updating general aesthetic embeddings through message passing.
- Final FC projects to a 512-dimensional aesthetic vector.
Emotion: The WECL network outputs a one-hot vector, projected to 512-D using a linear layer.
Fusion Mechanism: LAFF block learns convex attention weights over semantic, aesthetic, and emotional vectors to create a unified global representation $c_{film}$ .

5. Maestro-Guided Style Transfer and Inference Modes

VideoScore-2’s style transfer capability is driven by maestro-referenced local controls:

Reference Extraction: Melody ( $C_{mel}$ ) and dynamics ( $C_{dyn}$ ) features are derived from a single reference clip.
Vanilla Film Score Generation: Zero local controls for unbiased soundtrack creation.
Style Transferred Generation: Inserts maestro’s melody and dynamics controls during inference; no extra loss term is required.
Control Mechanism: Style influence is implemented exclusively through classifier-free guidance in the local-control branch.
Evaluation: Fidelity to reference is assessed by 12-class pitch accuracy (melody) and Pearson correlation (dynamics).

6. Evaluation Metrics: Originality and Recognizability

A novel two-dimensional metric scheme is proposed:

Originality ( $\sigma_\rho$ ):
- For each class $j$ , $\sigma_\rho = \sum_{j=1}^{n_c}\sqrt{\frac{1}{N-1}\sum_{i=1}^N [f(m_i^j) - \frac{1}{N}\sum_{i=1}^N f(m_i^j)]^2}$ , with $f(m)$ denoting feature extraction via a pre-trained AudioLDM encoder in mel-spectrogram space.
Recognizability: Defined as one-shot style classifier accuracy over generated samples, averaged across classes.
Interpretation: Plotting models in $(\sigma_\rho, accuracy)$ space visualizes the trade-off between innovation and style fidelity.

7. Dataset: FilmScoreDB and Benchmarks

FilmScoreDB underpins VideoScore-2’s training and evaluation:

Size and Composition: 32,520 video–music pairs (90.35 hours) sourced from ~300 award-nominated films; clips are 10 seconds with vocals removed (Demucs).
Labels and Metadata: Composer (134 top-tier), genre (multi-label), maestro style.
Data Splits: Train (26,730), validation (2,895), test (2,895).
Preprocessing: 10 fps video frames, mel-spectrogram encoding via VAE for latent $z_0$ representation.
Additional Benchmark: EmoMV dataset (16,726 pairs, six emotions) for comparative baseline evaluation.

8. Empirical Performance and Ablation Studies

On FilmScoreDB, VideoScore-2 (HPM framework) demonstrates strong quantitative and qualitative performance:

Automatic Film Score Generation:
- BCS: 66.7, BHS: 62.2, F1: 64.4 (outperforming DIFF-Foley: 64.2/59.7/61.6)
- IS: 4.4, KL: 5.3, FAD: 7.1 vs. best baseline FAD: 8.3
- Stability: CSD: 18.0, HSD: 16.2 (lower is more stable)
- Mean Opinion Score (MOS): 3.9 (vs. 3.4 baseline)
EmoMV Performance: Confirmed similar margins.
Style Transfer:
- Melody-only: 57.6% accuracy vs. DITTO 51.3%
- Dynamics correlation: 60.3% vs. 54.3% (micro)
- Combined: 57.7% accuracy vs. 52.3%; correlation 87.2% vs. 82.4%
Originality-Recognizability Plot: HPM achieves upper-right placement, indicating balanced innovation and fidelity.
LoRA Ablation: Reduction from 87M to 20M trainable parameters (train time: 48h to 12h) with negligible performance drop.

These results position VideoScore-2 as a reference benchmark for automatic film score generation, style transfer under maestro guidance, and serve as a foundation for subsequent work on video-music generative modeling (Qi et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Harmonizing Pixels and Melodies: Maestro-Guided Film Score Generation and Composition Style Transfer (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VideoScore-2.