Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity (2407.10387v1)

Published 15 Jul 2024 in cs.SD, cs.AI, cs.CV, and eess.AS

Abstract: Video-to-audio (V2A) generation leverages visual-only video features to render plausible sounds that match the scene. Importantly, the generated sound onsets should match the visual actions that are aligned with them, otherwise unnatural synchronization artifacts arise. Recent works have explored the progression of conditioning sound generators on still images and then video features, focusing on quality and semantic matching while ignoring synchronization, or by sacrificing some amount of quality to focus on improving synchronization only. In this work, we propose a V2A generative model, named MaskVAT, that interconnects a full-band high-quality general audio codec with a sequence-to-sequence masked generative model. This combination allows modeling both high audio quality, semantic matching, and temporal synchronicity at the same time. Our results show that, by combining a high-quality codec with the proper pre-trained audio-visual features and a sequence-to-sequence parallel structure, we are able to yield highly synchronized results on one hand, whilst being competitive with the state of the art of non-codec generative audio models. Sample videos and generated audios are available at https://maskvat.github.io .

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

Abstract

In recent developments, the field of Video-to-Audio (V2A) generation has seen significant strides, with various models attempting to deliver plausible audio tracks based on visual video content. The challenge lies in generating high-quality audio that aligns temporally with the visual events. The paper "Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity" proposes an innovative approach, MaskVAT, which interlinks a high-quality general audio codec with a sequence-to-sequence masked generative model, aiming to balance audio quality, semantic matching, and temporal synchronization.

Introduction

Audio-visual cross-modal generation is increasingly pivotal in applications like automated dubbing and foley sound effect generation. Traditional V2A models often focused on either enhancing the quality and semantic matching of the generated audio or improving synchronization with visual actions. However, achieving high standards across all these dimensions remained elusive. MaskVAT addresses this by leveraging a full-band audio codec and an innovative generative architecture.

Methodology

Audio Tokenizer

MaskVAT employs the Descript Audio Codec (DAC) to encode audio into a latent space with a reduced framerate, facilitating efficient sequence-to-sequence modeling and maintaining high audio fidelity. The output codegram from DAC serves as the input for the MaskVAT generative model.

Masked Generative Video-to-Audio Transformer

The architecture comprises three variants:

  1. MaskVATAdaLN_\text{AdaLN}: Utilizes AdaLN blocks to condition the audio tokens on visual features, ensuring temporal synchronization through regular length adaptation.
  2. MaskVATSeq2seq_\text{Seq2seq}: Employs a sequence-to-sequence transformer structure, integrating semantic matching via BEATs audio features.
  3. MaskVATHybrid_\text{Hybrid}: Combines the strengths of the previous two approaches, leveraging both cross-attention and AdaLN blocks.

Training and Sampling

Training involves masking a subset of the audio tokens and minimizing the prediction error using a cross-entropy loss. Additionally, MaskVATSeq2seq_\text{Seq2seq} and MaskVATHybrid_\text{Hybrid} incorporate MSE and contrastive losses to enhance the semantic and temporal alignment of the generated audio. During sampling, a diversity term and classifier-free guidance improve the quality and alignment of the unmasked tokens iteratively.

Beam Selection

A post-sampling selection strategy is employed to refine the generated audio by minimizing the distance between the input video and audio embeddings, further enhancing semantic and temporal alignment.

Experiments

Dataset and Baselines

MaskVAT was trained and evaluated on the VGGSound dataset, emphasizing diverse and high-quality audio-visual pairs. Comprehensive comparisons were drawn against state-of-the-art models like SpecVQGAN, Im2Wav, V2A-Mapper, and Diff-Foley.

Objective Metrics

Evaluation metrics included:

  • Quality: Fréchet Distance (FD) across DAC and MFCC embeddings, and FAD for low-band content.
  • Semantic Matching: WaveCLIP (WC) and CycleCLIP (CC) metrics, leveraging Wav2CLIP projections.
  • Temporal Alignment: Novelty Score (NS) and SparseSync (SS) metrics assess audio-to-audio and audio-video synchronicity, respectively.

Results

MaskVAT outperformed baselines across most metrics:

  • Superior quality as indicated by lower FDD and FDM scores.
  • Enhanced semantic matching, particularly in the MaskVATSeq2seq_\text{Seq2seq} and MaskVATHybrid_\text{Hybrid} variants.
  • Notably better temporal alignment, especially with MaskVATHybrid_\text{Hybrid}, as shown by higher NS and lower SS scores.

Subjective Evaluation

Human evaluations corroborated the objective results, with MaskVAT receiving higher ratings for temporal alignment and overall quality. Expert listeners also highlighted its competitive audio fidelity and semantic relevance.

Conclusion

MaskVAT integrates a robust audio codec with an innovative masked generative model to deliver high-quality, semantically aligned, and temporally synchronized audio for V2A tasks. This positions MaskVAT as a superior choice for applications requiring precise audio-visual synchrony, with prospects for further refinement and potential integration into multimedia production pipelines. Future work may explore fine-tuning the model on specialized datasets and extending its capabilities to multi-modal scenarios beyond video-to-audio synthesis.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Santiago Pascual (30 papers)
  2. Chunghsin Yeh (4 papers)
  3. Ioannis Tsiamas (12 papers)
  4. Joan Serrà (53 papers)
Citations (5)
Github Logo Streamline Icon: https://streamlinehq.com