Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity (2407.10387v1)

Published 15 Jul 2024 in cs.SD, cs.AI, cs.CV, and eess.AS

Abstract: Video-to-audio (V2A) generation leverages visual-only video features to render plausible sounds that match the scene. Importantly, the generated sound onsets should match the visual actions that are aligned with them, otherwise unnatural synchronization artifacts arise. Recent works have explored the progression of conditioning sound generators on still images and then video features, focusing on quality and semantic matching while ignoring synchronization, or by sacrificing some amount of quality to focus on improving synchronization only. In this work, we propose a V2A generative model, named MaskVAT, that interconnects a full-band high-quality general audio codec with a sequence-to-sequence masked generative model. This combination allows modeling both high audio quality, semantic matching, and temporal synchronicity at the same time. Our results show that, by combining a high-quality codec with the proper pre-trained audio-visual features and a sequence-to-sequence parallel structure, we are able to yield highly synchronized results on one hand, whilst being competitive with the state of the art of non-codec generative audio models. Sample videos and generated audios are available at https://maskvat.github.io .

PDF HTML Abstract

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

Abstract

In recent developments, the field of Video-to-Audio (V2A) generation has seen significant strides, with various models attempting to deliver plausible audio tracks based on visual video content. The challenge lies in generating high-quality audio that aligns temporally with the visual events. The paper "Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity" proposes an innovative approach, MaskVAT, which interlinks a high-quality general audio codec with a sequence-to-sequence masked generative model, aiming to balance audio quality, semantic matching, and temporal synchronization.

Introduction

Audio-visual cross-modal generation is increasingly pivotal in applications like automated dubbing and foley sound effect generation. Traditional V2A models often focused on either enhancing the quality and semantic matching of the generated audio or improving synchronization with visual actions. However, achieving high standards across all these dimensions remained elusive. MaskVAT addresses this by leveraging a full-band audio codec and an innovative generative architecture.

Methodology

Audio Tokenizer

MaskVAT employs the Descript Audio Codec (DAC) to encode audio into a latent space with a reduced framerate, facilitating efficient sequence-to-sequence modeling and maintaining high audio fidelity. The output codegram from DAC serves as the input for the MaskVAT generative model.

Masked Generative Video-to-Audio Transformer

The architecture comprises three variants:

MaskVAT $_\text{AdaLN}$ : Utilizes AdaLN blocks to condition the audio tokens on visual features, ensuring temporal synchronization through regular length adaptation.
MaskVAT $_\text{Seq2seq}$ : Employs a sequence-to-sequence transformer structure, integrating semantic matching via BEATs audio features.
MaskVAT $_\text{Hybrid}$ : Combines the strengths of the previous two approaches, leveraging both cross-attention and AdaLN blocks.

Training and Sampling

Training involves masking a subset of the audio tokens and minimizing the prediction error using a cross-entropy loss. Additionally, MaskVAT $_\text{Seq2seq}$ and MaskVAT $_\text{Hybrid}$ incorporate MSE and contrastive losses to enhance the semantic and temporal alignment of the generated audio. During sampling, a diversity term and classifier-free guidance improve the quality and alignment of the unmasked tokens iteratively.

Beam Selection

A post-sampling selection strategy is employed to refine the generated audio by minimizing the distance between the input video and audio embeddings, further enhancing semantic and temporal alignment.

Experiments

Dataset and Baselines

MaskVAT was trained and evaluated on the VGGSound dataset, emphasizing diverse and high-quality audio-visual pairs. Comprehensive comparisons were drawn against state-of-the-art models like SpecVQGAN, Im2Wav, V2A-Mapper, and Diff-Foley.

Objective Metrics

Evaluation metrics included:

Quality: Fréchet Distance (FD) across DAC and MFCC embeddings, and FAD for low-band content.
Semantic Matching: WaveCLIP (WC) and CycleCLIP (CC) metrics, leveraging Wav2CLIP projections.
Temporal Alignment: Novelty Score (NS) and SparseSync (SS) metrics assess audio-to-audio and audio-video synchronicity, respectively.

Results

MaskVAT outperformed baselines across most metrics:

Superior quality as indicated by lower FDD and FDM scores.
Enhanced semantic matching, particularly in the MaskVAT $_\text{Seq2seq}$ and MaskVAT $_\text{Hybrid}$ variants.
Notably better temporal alignment, especially with MaskVAT $_\text{Hybrid}$ , as shown by higher NS and lower SS scores.

Subjective Evaluation

Human evaluations corroborated the objective results, with MaskVAT receiving higher ratings for temporal alignment and overall quality. Expert listeners also highlighted its competitive audio fidelity and semantic relevance.

Conclusion

MaskVAT integrates a robust audio codec with an innovative masked generative model to deliver high-quality, semantically aligned, and temporally synchronized audio for V2A tasks. This positions MaskVAT as a superior choice for applications requiring precise audio-visual synchrony, with prospects for further refinement and potential integration into multimedia production pipelines. Future work may explore fine-tuning the model on specialized datasets and extending its capabilities to multi-modal scenarios beyond video-to-audio synthesis.