VATT: Video-Audio-Text Transformer

Updated 20 January 2026

VATT is a multimodal Transformer framework that processes raw video, audio, and text signals using a unified, convolution-free architecture.
It employs modality-agnostic parameter sharing and combines contrastive with masked-token objectives to achieve competitive performance across various benchmarks.
Recent advances extend VATT to controllable video-to-audio generation, integrating language models for fine-grained semantic guidance and effective synthesis.

The Video-Audio-Text Transformer (VATT) denotes a family of Transformer frameworks designed for multimodal representation learning and conditional generative modeling, operating directly on raw video, audio, and text signals in a convolution-free regime. VATT encompasses both the foundational self-supervised representation learner for downstream tasks (Akbari et al., 2021) and, in subsequent advances, a generative system for controllable video-to-audio transformation through text guidance (Liu et al., 2024). Core innovations include unified Transformer architectures, modality-agnostic parameter sharing, contrastive and masked-token objectives, and the integration of LLMs for fine-grained semantic control.

1. Model Architectures in VATT

The original VATT (Akbari et al., 2021) utilizes a stack of Transformer layers for each modality, eschewing convolutional stages in favor of direct input tokenization:

Tokenization:
- Video frames ( $T \times H \times W$ ) are divided into non-overlapping spatio-temporal patches ( $t \times h \times w$ ), linearly projected via $W_{vp} \in \mathbb{R}^{(t h w 3) \times d}$ .
- Audio waveforms (e.g., $1.2\,s$ at $48$ kHz) are chunked into segments ( $t'$ samples) and linearly projected by $W_{ap} \in \mathbb{R}^{t' \times d}$ .
- Text is mapped to one-hot vectors in $\mathbb{R}^v$ ( $v \approx 2^{16}$ ) and embedded via $W_{tp} \in \mathbb{R}^{v \times d}$ .
Positional Encoding: Modality-specific spatial/temporal embeddings for video; absolute or relative encodings for text and audio.
Transformer Core: Each backbone adopts the ViT/BERT architecture with $L$ layers, $d$ -dimensional tokens, Multi-Head Self-Attention, MLP blocks (hidden size $m$ ), LayerNorm, and GeLU activations. Four model sizes scale from $L=6$ to $L=24$ .
Modality-Agnostic Backbone: Parameters for attentional and MLP layers are shared across modalities, with only tokenization and positional encodings remaining distinct. VATT-MA (Medium, $12 \times 1024$ , $h=16$ ) demonstrates this unified approach.

Generative VATT frameworks (Liu et al., 2024) incorporate two modules for video-to-audio synthesis:

VATT Converter: Vision encoder (eva-CLIP) extracts per-frame features, projected into LLM space via $W_l \in \mathbb{R}^{d_v \times d_{lm}}$ , and processed by a frozen instruction-tuned LLM (Vicuna-7B or Gemma-2B) augmented with LoRA adapters. Outputs include audio caption tokens and last-layer hidden states.
VATT Audio: A bidirectional Transformer decoder ( $L_{mm}=24$ , $d_{mm}=1024$ , $h=16$ ) consumes concatenated condition embeddings and masked audio token embeddings, producing audio tokens via iterative parallel decoding. Neural codec (Encodec) translates audio tokens to waveform.

2. Input Representation and Preprocessing

Distinct protocols govern raw data handling:

Video: Clips sampled at 10 fps, spatially cropped ( $8\%-100\%$ area, aspect ratio $[0.5,2]$ ), resized to $224 \times 224$ , augmented by flipping and color jitter. Patch sizes set at $4 \times 16 \times 16$ yield $1568$ tokens per clip; $50\%$ are randomly discarded (DropToken) for computational efficiency.
Audio: Synchronized 48 kHz waveforms, chunked into $128$-sample segments for $1200$ tokens/clip, with $50\%$ DropToken.
Text: Derived from ASR transcripts, clipped/padded to $16$ tokens, embedded by one-hot lookup (no DropToken).

Preprocessing in the generative VATT system follows similar principles, utilizing eva-CLIP embeddings for video and Encodec representations for audio (four codebooks, $T_c=500$ time steps for 10 s at 16 kHz).

3. Training Objectives and Optimization

VATT’s learning paradigm is anchored in multimodal contrastive losses and captioning/generation objectives:

Contrastive Losses (Akbari et al., 2021):
- Video–Audio InfoNCE:
$L_{va} = - \frac{1}{N} \sum_{i=1}^N \log \frac{ \exp(v_{va}^{(i)} \cdot a_{va}^{(i)} / \tau) } { \sum_{j=1}^N \exp(v_{va}^{(i)} \cdot a_{va}^{(j)} / \tau) }$ - Video–Text MIL-NCE: Uses $K=5$ nearest text positives, negatives from other batch instances, and weighting $\lambda=1$ . - Total loss: $L = L_{va} + \lambda L_{vt}$ .
Generative Losses (Liu et al., 2024):
- Converter Captioning:
$\mathcal{L}_{v2t} = -\sum_{l=1}^N \log P_\theta(t_{al}\mid T_i, V_{lm})$ - Audio Token Classification:

$\mathcal{L}_{audio} = -\sum_{i: A^{M}_{tok,i} = <MASK>} \log P_\phi(\hat a_i = a^{gt}_i \mid A^{M}_{tok}, H_{lm})$ - Optimization: Adam or AdamW, large-batch regime (2048 in (Akbari et al., 2021); up to 48/36 in (Liu et al., 2024)), DropToken, strong data augmentation, and staged learning-rate schedules.

Pre-training utilizes HowTo100M and AudioSet (representation learning), and VGGSound and synthetic instruction-caption datasets (video-to-audio generation).

4. Downstream Tasks and Evaluation

VATT models demonstrate competitive or state-of-the-art performance on multiple benchmarks:

Video Action Recognition (fine-tuned) (Akbari et al., 2021):
- Kinetics-400: 82.1% top-1 (VATT-Large, new self-supervised record)
- Kinetics-600: 83.6%, Kinetics-700: 72.7%, Moments-in-Time: 41.1%
Audio Event Classification: VATT-Base achieves mAP 39.4%, AUC 97.1%, $d' = 2.895$ (AudioSet)
Image Classification: VATT pre-training yields 78.7% top-1 ImageNet accuracy (vs 64.7% from scratch; ViT-Base/JFT: 79.9% supervised)
Zero-Shot Retrieval: YouCook2 R@10 ≈45.5, MSR-VTT R@10≈29.7, MedR≈49
Video-to-Audio Generation (Liu et al., 2024) (VGGSound test set):
- KLD as low as 1.41 (VATT-LLama-T, GT text prompt)
- FAD down to 2.35; Align Acc 82.8% (VATT-Gemma)
- Subjective preference: VATT-LLama-T rated highest for relevance.

Method	KLD ↓	FAD ↓	Align Acc ↑	Speed (s) ↓
SpecVQGAN	3.78	6.63	48.8	7.2
IM2WAV	2.54	6.32	74.3	289.5
FoleyGen	2.89	2.59	73.8	6.9
V2A-Mapper	2.78	0.99	74.4	11.5
VATT-Gemma-T	1.66	2.98	81.5	0.76

Text guidance consistently improves KLD and subjective ratings, confirming effective controllability.

5. Modality-Agnostic and Controllable Generation

A key property of VATT (Akbari et al., 2021) is modality-agnostic Transformer parameterization, where shared self-attention and MLP weights process video, audio, and text streams equivalently, with empirical parity between shared and separate backbones across tasks. Performance losses from aggressive DropToken regularization remain minimal (<1%).

VATT (Liu et al., 2024) establishes controllable video-to-audio generation: the same video input paired with differing text prompts produces distinct plausible audios (e.g., “gentle water fountain” vs. “crowd talking and laughing” vs. “classical music background”). The VATT Converter generates default audio captions in absence of user prompt, supporting both suggestion and automated steering; mean opinion scores for synthetic captions reach 4.72 ± 0.37 (out of 5).

6. Ablative Studies and Analysis

Systematic ablation reveals architectural and hyperparameter effects:

DropToken: Quadratic reduction in pre-training compute; halved tokens induce <1% downstream accuracy loss.
Patch Size: Video patching at $4 \times 16 \times 16$ optimal; audio patch length $128$ samples superior.
Model Scaling: Scaling (Small → Base → Medium → Large) yields monotonic increases in accuracy across video and image tasks.
Spectrograms vs. Waveforms: Spectrogram input for audio yields similar retrieval accuracy, validating direct waveform modeling.

Computation remains intensive, mitigated by dropout strategies. Reliance on co-occurrence of modalities (speech/audio with video) limits applicability to silent or poorly transcribed samples.

7. Limitations and Prospective Directions

Limitations include:

Dependence on co-occurring modalities: video-only data, or noisy speech transcripts, constrain representation quality.
High computational demands for training large-scale Transformers, partly alleviated by token dropout.
Text quality (from ASR or instruction synthesis) affects downstream controllability and retrieval.

Suggested directions for future research span improved data augmentation, more reliable textual transcripts, mixture-of-experts for dynamic modality routing, and scaling model/data sizes. Integration with advanced generative codecs and more flexible condition embedding are also under exploration.

Recent advances indicate that attention mechanisms enable unified, self-supervised multimodal learning at scale, providing robust representations and semantically controllable generative models for diverse downstream applications (Akbari et al., 2021, Liu et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text (2021)

Tell What You Hear From What You See -- Video to Audio Generation Through Text (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Video-Audio-Text Transformer (VATT).