Papers
Topics
Authors
Recent
Search
2000 character limit reached

VATT: Video-Audio-Text Transformer

Updated 20 January 2026
  • VATT is a multimodal Transformer framework that processes raw video, audio, and text signals using a unified, convolution-free architecture.
  • It employs modality-agnostic parameter sharing and combines contrastive with masked-token objectives to achieve competitive performance across various benchmarks.
  • Recent advances extend VATT to controllable video-to-audio generation, integrating language models for fine-grained semantic guidance and effective synthesis.

The Video-Audio-Text Transformer (VATT) denotes a family of Transformer frameworks designed for multimodal representation learning and conditional generative modeling, operating directly on raw video, audio, and text signals in a convolution-free regime. VATT encompasses both the foundational self-supervised representation learner for downstream tasks (Akbari et al., 2021) and, in subsequent advances, a generative system for controllable video-to-audio transformation through text guidance (Liu et al., 2024). Core innovations include unified Transformer architectures, modality-agnostic parameter sharing, contrastive and masked-token objectives, and the integration of LLMs for fine-grained semantic control.

1. Model Architectures in VATT

The original VATT (Akbari et al., 2021) utilizes a stack of Transformer layers for each modality, eschewing convolutional stages in favor of direct input tokenization:

  • Tokenization:
    • Video frames (T×H×WT \times H \times W) are divided into non-overlapping spatio-temporal patches (t×h×wt \times h \times w), linearly projected via WvpR(thw3)×dW_{vp} \in \mathbb{R}^{(t h w 3) \times d}.
    • Audio waveforms (e.g., 1.2s1.2\,s at $48$ kHz) are chunked into segments (tt' samples) and linearly projected by WapRt×dW_{ap} \in \mathbb{R}^{t' \times d}.
    • Text is mapped to one-hot vectors in Rv\mathbb{R}^v (v216v \approx 2^{16}) and embedded via WtpRv×dW_{tp} \in \mathbb{R}^{v \times d}.
  • Positional Encoding: Modality-specific spatial/temporal embeddings for video; absolute or relative encodings for text and audio.
  • Transformer Core: Each backbone adopts the ViT/BERT architecture with LL layers, dd-dimensional tokens, Multi-Head Self-Attention, MLP blocks (hidden size mm), LayerNorm, and GeLU activations. Four model sizes scale from L=6L=6 to L=24L=24.
  • Modality-Agnostic Backbone: Parameters for attentional and MLP layers are shared across modalities, with only tokenization and positional encodings remaining distinct. VATT-MA (Medium, 12×102412 \times 1024, h=16h=16) demonstrates this unified approach.

Generative VATT frameworks (Liu et al., 2024) incorporate two modules for video-to-audio synthesis:

  • VATT Converter: Vision encoder (eva-CLIP) extracts per-frame features, projected into LLM space via WlRdv×dlmW_l \in \mathbb{R}^{d_v \times d_{lm}}, and processed by a frozen instruction-tuned LLM (Vicuna-7B or Gemma-2B) augmented with LoRA adapters. Outputs include audio caption tokens and last-layer hidden states.
  • VATT Audio: A bidirectional Transformer decoder (Lmm=24L_{mm}=24, dmm=1024d_{mm}=1024, h=16h=16) consumes concatenated condition embeddings and masked audio token embeddings, producing audio tokens via iterative parallel decoding. Neural codec (Encodec) translates audio tokens to waveform.

2. Input Representation and Preprocessing

Distinct protocols govern raw data handling:

  • Video: Clips sampled at 10 fps, spatially cropped (8%100%8\%-100\% area, aspect ratio [0.5,2][0.5,2]), resized to 224×224224 \times 224, augmented by flipping and color jitter. Patch sizes set at 4×16×164 \times 16 \times 16 yield $1568$ tokens per clip; 50%50\% are randomly discarded (DropToken) for computational efficiency.
  • Audio: Synchronized 48 kHz waveforms, chunked into $128$-sample segments for $1200$ tokens/clip, with 50%50\% DropToken.
  • Text: Derived from ASR transcripts, clipped/padded to $16$ tokens, embedded by one-hot lookup (no DropToken).

Preprocessing in the generative VATT system follows similar principles, utilizing eva-CLIP embeddings for video and Encodec representations for audio (four codebooks, Tc=500T_c=500 time steps for 10 s at 16 kHz).

3. Training Objectives and Optimization

VATT’s learning paradigm is anchored in multimodal contrastive losses and captioning/generation objectives:

  • Contrastive Losses (Akbari et al., 2021):
    • Video–Audio InfoNCE:

    Lva=1Ni=1Nlogexp(vva(i)ava(i)/τ)j=1Nexp(vva(i)ava(j)/τ)L_{va} = - \frac{1}{N} \sum_{i=1}^N \log \frac{ \exp(v_{va}^{(i)} \cdot a_{va}^{(i)} / \tau) } { \sum_{j=1}^N \exp(v_{va}^{(i)} \cdot a_{va}^{(j)} / \tau) } - Video–Text MIL-NCE: Uses K=5K=5 nearest text positives, negatives from other batch instances, and weighting λ=1\lambda=1. - Total loss: L=Lva+λLvtL = L_{va} + \lambda L_{vt}.

  • Generative Losses (Liu et al., 2024):

    • Converter Captioning:

    Lv2t=l=1NlogPθ(talTi,Vlm)\mathcal{L}_{v2t} = -\sum_{l=1}^N \log P_\theta(t_{al}\mid T_i, V_{lm}) - Audio Token Classification:

    Laudio=i:Atok,iM=<MASK>logPϕ(a^i=aigtAtokM,Hlm)\mathcal{L}_{audio} = -\sum_{i: A^{M}_{tok,i} = <MASK>} \log P_\phi(\hat a_i = a^{gt}_i \mid A^{M}_{tok}, H_{lm}) - Optimization: Adam or AdamW, large-batch regime (2048 in (Akbari et al., 2021); up to 48/36 in (Liu et al., 2024)), DropToken, strong data augmentation, and staged learning-rate schedules.

Pre-training utilizes HowTo100M and AudioSet (representation learning), and VGGSound and synthetic instruction-caption datasets (video-to-audio generation).

4. Downstream Tasks and Evaluation

VATT models demonstrate competitive or state-of-the-art performance on multiple benchmarks:

  • Video Action Recognition (fine-tuned) (Akbari et al., 2021):

    • Kinetics-400: 82.1% top-1 (VATT-Large, new self-supervised record)
    • Kinetics-600: 83.6%, Kinetics-700: 72.7%, Moments-in-Time: 41.1%
  • Audio Event Classification: VATT-Base achieves mAP 39.4%, AUC 97.1%, d=2.895d' = 2.895 (AudioSet)
  • Image Classification: VATT pre-training yields 78.7% top-1 ImageNet accuracy (vs 64.7% from scratch; ViT-Base/JFT: 79.9% supervised)
  • Zero-Shot Retrieval: YouCook2 R@10 ≈45.5, MSR-VTT R@10≈29.7, MedR≈49
  • Video-to-Audio Generation (Liu et al., 2024) (VGGSound test set):
    • KLD as low as 1.41 (VATT-LLama-T, GT text prompt)
    • FAD down to 2.35; Align Acc 82.8% (VATT-Gemma)
    • Subjective preference: VATT-LLama-T rated highest for relevance.
Method KLD ↓ FAD ↓ Align Acc ↑ Speed (s) ↓
SpecVQGAN 3.78 6.63 48.8 7.2
IM2WAV 2.54 6.32 74.3 289.5
FoleyGen 2.89 2.59 73.8 6.9
V2A-Mapper 2.78 0.99 74.4 11.5
VATT-Gemma-T 1.66 2.98 81.5 0.76

Text guidance consistently improves KLD and subjective ratings, confirming effective controllability.

5. Modality-Agnostic and Controllable Generation

A key property of VATT (Akbari et al., 2021) is modality-agnostic Transformer parameterization, where shared self-attention and MLP weights process video, audio, and text streams equivalently, with empirical parity between shared and separate backbones across tasks. Performance losses from aggressive DropToken regularization remain minimal (<1%).

VATT (Liu et al., 2024) establishes controllable video-to-audio generation: the same video input paired with differing text prompts produces distinct plausible audios (e.g., “gentle water fountain” vs. “crowd talking and laughing” vs. “classical music background”). The VATT Converter generates default audio captions in absence of user prompt, supporting both suggestion and automated steering; mean opinion scores for synthetic captions reach 4.72 ± 0.37 (out of 5).

6. Ablative Studies and Analysis

Systematic ablation reveals architectural and hyperparameter effects:

  • DropToken: Quadratic reduction in pre-training compute; halved tokens induce <1% downstream accuracy loss.
  • Patch Size: Video patching at 4×16×164 \times 16 \times 16 optimal; audio patch length $128$ samples superior.
  • Model Scaling: Scaling (Small → Base → Medium → Large) yields monotonic increases in accuracy across video and image tasks.
  • Spectrograms vs. Waveforms: Spectrogram input for audio yields similar retrieval accuracy, validating direct waveform modeling.

Computation remains intensive, mitigated by dropout strategies. Reliance on co-occurrence of modalities (speech/audio with video) limits applicability to silent or poorly transcribed samples.

7. Limitations and Prospective Directions

Limitations include:

  • Dependence on co-occurring modalities: video-only data, or noisy speech transcripts, constrain representation quality.
  • High computational demands for training large-scale Transformers, partly alleviated by token dropout.
  • Text quality (from ASR or instruction synthesis) affects downstream controllability and retrieval.

Suggested directions for future research span improved data augmentation, more reliable textual transcripts, mixture-of-experts for dynamic modality routing, and scaling model/data sizes. Integration with advanced generative codecs and more flexible condition embedding are also under exploration.

Recent advances indicate that attention mechanisms enable unified, self-supervised multimodal learning at scale, providing robust representations and semantically controllable generative models for diverse downstream applications (Akbari et al., 2021, Liu et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Video-Audio-Text Transformer (VATT).