Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity
Abstract
In recent developments, the field of Video-to-Audio (V2A) generation has seen significant strides, with various models attempting to deliver plausible audio tracks based on visual video content. The challenge lies in generating high-quality audio that aligns temporally with the visual events. The paper "Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity" proposes an innovative approach, MaskVAT, which interlinks a high-quality general audio codec with a sequence-to-sequence masked generative model, aiming to balance audio quality, semantic matching, and temporal synchronization.
Introduction
Audio-visual cross-modal generation is increasingly pivotal in applications like automated dubbing and foley sound effect generation. Traditional V2A models often focused on either enhancing the quality and semantic matching of the generated audio or improving synchronization with visual actions. However, achieving high standards across all these dimensions remained elusive. MaskVAT addresses this by leveraging a full-band audio codec and an innovative generative architecture.
Methodology
Audio Tokenizer
MaskVAT employs the Descript Audio Codec (DAC) to encode audio into a latent space with a reduced framerate, facilitating efficient sequence-to-sequence modeling and maintaining high audio fidelity. The output codegram from DAC serves as the input for the MaskVAT generative model.
Masked Generative Video-to-Audio Transformer
The architecture comprises three variants:
- MaskVAT: Utilizes AdaLN blocks to condition the audio tokens on visual features, ensuring temporal synchronization through regular length adaptation.
- MaskVAT: Employs a sequence-to-sequence transformer structure, integrating semantic matching via BEATs audio features.
- MaskVAT: Combines the strengths of the previous two approaches, leveraging both cross-attention and AdaLN blocks.
Training and Sampling
Training involves masking a subset of the audio tokens and minimizing the prediction error using a cross-entropy loss. Additionally, MaskVAT and MaskVAT incorporate MSE and contrastive losses to enhance the semantic and temporal alignment of the generated audio. During sampling, a diversity term and classifier-free guidance improve the quality and alignment of the unmasked tokens iteratively.
Beam Selection
A post-sampling selection strategy is employed to refine the generated audio by minimizing the distance between the input video and audio embeddings, further enhancing semantic and temporal alignment.
Experiments
Dataset and Baselines
MaskVAT was trained and evaluated on the VGGSound dataset, emphasizing diverse and high-quality audio-visual pairs. Comprehensive comparisons were drawn against state-of-the-art models like SpecVQGAN, Im2Wav, V2A-Mapper, and Diff-Foley.
Objective Metrics
Evaluation metrics included:
- Quality: Fréchet Distance (FD) across DAC and MFCC embeddings, and FAD for low-band content.
- Semantic Matching: WaveCLIP (WC) and CycleCLIP (CC) metrics, leveraging Wav2CLIP projections.
- Temporal Alignment: Novelty Score (NS) and SparseSync (SS) metrics assess audio-to-audio and audio-video synchronicity, respectively.
Results
MaskVAT outperformed baselines across most metrics:
- Superior quality as indicated by lower FDD and FDM scores.
- Enhanced semantic matching, particularly in the MaskVAT and MaskVAT variants.
- Notably better temporal alignment, especially with MaskVAT, as shown by higher NS and lower SS scores.
Subjective Evaluation
Human evaluations corroborated the objective results, with MaskVAT receiving higher ratings for temporal alignment and overall quality. Expert listeners also highlighted its competitive audio fidelity and semantic relevance.
Conclusion
MaskVAT integrates a robust audio codec with an innovative masked generative model to deliver high-quality, semantically aligned, and temporally synchronized audio for V2A tasks. This positions MaskVAT as a superior choice for applications requiring precise audio-visual synchrony, with prospects for further refinement and potential integration into multimedia production pipelines. Future work may explore fine-tuning the model on specialized datasets and extending its capabilities to multi-modal scenarios beyond video-to-audio synthesis.