xGen-MM-Vid: Efficient Video Token Compression

Updated 26 January 2026

xGen-MM-Vid is a multimodal video model that compresses thousands of patch features into as few as 32 tokens, drastically reducing computational overhead.
It integrates Perceiver-based frame tokenization with two temporal encoding strategies (TokenLearner and Sequential TTM) to effectively capture spatio-temporal dynamics.
Empirical results show that, with only 4B parameters, the model achieves competitive video QA and captioning performance compared to larger counterparts.

xGen-MM-Vid (BLIP-3-Video) is a multimodal LLM (VLM) designed for efficient video understanding by compressing temporal and spatial visual information into extremely compact token representations. The architecture introduces a temporal encoder that, together with a visual tokenizer, enables the mapping of entire video clips to as few as 32 visual tokens. This schema allows the model, with only 4B parameters, to achieve video question-answering and captioning performance on par with substantially larger models while using orders of magnitude fewer visual tokens (Ryoo et al., 2024).

1. Visual Tokenization and Frame Encoding

The foundation of xGen-MM-Vid’s video pipeline is its frame-wise visual tokenizer. For each raw frame $x_t \in \mathbb{R}^{H \times W \times 3}$ (where $H=W=384$ ), the image is partitioned into non-overlapping $16 \times 16$ patches, yielding $\ell=(H/p) \cdot (W/p)=24 \cdot 24=576$ patches per frame. Each patch $x_{t,i} \in \mathbb{R}^{16 \times 16 \times 3}$ is linearly embedded:

$z_{t,i} = W_e \cdot \text{vec}(x_{t,i}) + b_e, \quad z_{t,i} \in \mathbb{R}^{1152}$

A Perceiver-Resampler, functioning as a light-weight cross-attention module, reduces the set $\{z_{t,1},...,z_{t,576}\}$ to $N=128$ tokens per frame, compactly represented as $V_t \in \mathbb{R}^{128 \times d}$ . For $T=8$ uniformly sampled frames, the tokenized video is concatenated as $V = [V_1;...;V_8] \in \mathbb{R}^{1024 \times d}$ .

2. Temporal Encoder Mechanisms

To obtain a highly compressed global representation of the entire video, xGen-MM-Vid employs either:

a) Learnable Spatio-Temporal Pooling (TokenLearner):

A pooling function $A: \mathbb{R}^{d \times K} \rightarrow \mathbb{R}^{M \times K}$ (for $K=1024$ input tokens and $M$ output tokens) is implemented as an MLP over the input token matrix. The learned attention matrix $A(V)$ is produced as $A(V)_{ij} = \text{softmax}(\alpha(V^T))_{ij}$ with $\alpha(V^T) = \text{MLP}_m(V^T) \in \mathbb{R}^{M \times K}$ . The final $M$ spatio-temporal tokens are obtained by $X = A(V) \cdot V$ , i.e., as a weighted sum that “soft-selects” informative tokens across frames and space.

b) Sequential Token Turing Machine (TTM):

A memory state $M_{t-1} \in \mathbb{R}^{G \cdot N \times d}$ maintains temporal context by recurrently integrating each frame’s tokens $V_t$ . The memory updates are performed by a small 4-layer Transformer processor $f_\theta$ , encompassing TokenLearner read/write modules:

$M_t = f_\theta(M_{t-1}, V_t)$

After processing all frames, the final memory $M_T \in \mathbb{R}^{512 \times d}$ is pooled again by a TokenLearner module to obtain $M$ global tokens, $X = A_{seq}(M_T) \cdot M_T$ , with typical $M = 32$ .

3. Token Compression Pipeline

The token reduction is carried out in two abstraction stages. First, from $8 \times 729 = 5832$ raw patch embeddings per video, the Perceiver compresses to $8 \times 128 = 1024$ frame tokens. Next, either TokenLearner or TTM with pooling produces $M$ compressed visual tokens—commonly $M=32$ —by applying a learned attention mechanism $A \in \mathbb{R}^{32 \times 1024}$ such that $X = A \cdot V$ . Thus, the architecture achieves an extreme compression ratio, distilling thousands of patch features into a succinct set of global video tokens.

4. Pretraining and Instructional Tuning

BLIP-3-Video’s training regime has three stages, each employing the standard next-token cross-entropy loss over the LLM’s text outputs:

Stage 1: Image-Caption Pretraining

Model initialized from BLIP-3; SigLIP and Perceiver weights are frozen; temporal encoder initialized randomly.

Stage 2: Video-Caption Pretraining

900K video-caption pairs from LLaVA-Hound-DPO, captions rephrased by GPT-4.
Loss: $\mathcal{L}_{cap} = - \sum_{i=1}^L \log P(w_i|w_{<i}, X)$ .
Trained for 1 epoch with Adam optimizer, 8 $\times$ H100 GPUs, batch 16/GPU, learning rate $2 \times 10^{-5}$ , 500 warmup steps, cosine decay.

Stage 3: Video Instruction Tuning

Mixed corpus: 99K VideoChatGPT, MSVD-QA (30K), MSRVTT-QA (149K), ActivityNet-QA (32K), TGIF-QA (71K), NExT-QA (34K), Mira (935K captions).
Cross-entropy on open-ended QA (answers rephrased by GPT-3.5); multiple-choice QA via candidate score cross-entropy.
Batch 4/GPU, learning rate $10^{-5}$ , 1 epoch ( $\sim$ 12 hours).

5. Empirical Performance and Benchmark Comparison

The following table summarizes video QA and captioning results reported for BLIP-3-Video (4B, 32 tokens) against published baselines.

Task / Metric	BLIP-3-Video (4B, 32t)	Tarsier-7B (4608t)	PLLaVA-7B (576t)
MSVD-QA Acc./Score	77.7 / 4.2	77.0 / 4.1	76.6 / 4.1
MSRVTT-QA Acc./Score	60.0 / 3.6	—	—
ActivityNet-QA Acc./Score	55.7 / 3.5	—	—
TGIF-QA Acc./Score	76.5 / 4.3	79.2 / 4.2	77.5 / 4.1
NExT-QA Accuracy (32t)	76.4%	71.6%	—
NExT-QA Accuracy (128t)	77.1%	79.2% (34B)	—
MSVD-Cap Acc./Score	63.6 / 3.38	62.3 / 3.37	—
MSRVTT-Cap Acc./Score	42.1 / 2.82	40.3 / 2.77	—
Mira-Cap Acc./Score	80.7 / 3.96	40.6 / 2.87	—

Key findings are that BLIP-3-Video, using only 32 tokens, achieves accuracy comparable to much larger competitors (Tarsier-34B, PLLaVA-7B), despite a far smaller footprint and dramatically fewer tokens per video (Ryoo et al., 2024).

6. Computational Efficiency and Trade-Offs

The design targets the quadratic computational cost of self-attention with respect to total tokens in the LLM. Reducing the context from 1024 to 32 tokens yields a $2-3\times$ increase in training and inference throughput per GPU (from 3.3 to 8.2 samples/sec/GPU). Despite this aggressive compression, the learned spatio-temporal abstractions retain sufficient semantic information for competitive QA and captioning performance.

However, several trade-offs exist:

The fixed selection of 8 frames limits modeling of long-range temporal dependencies.
Excessive compression (e.g., $M<16$ ) may blur fine spatial details.
Hallucinations occasionally persist in open-ended language outputs.

7. Context, Limitations, and Outlook

xGen-MM-Vid (BLIP-3-Video) demonstrates that careful design of tokenizers and temporal encoders enables extreme reduction of visual context with minimal loss in downstream accuracy. The model achieves this via Perceiver-based frame compression and learnable spatio-temporal pooling or TTM memory modules. While the current instantiation processes only 8 frames and compresses down to 32 tokens, extension to longer-range dynamics and finer granularity may be warranted for applications with high temporal variability. The persistent hallucination phenomenon and trade-off between token count and spatial detail remain open areas for further study (Ryoo et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to xGen-MM-Vid (BLIP-3-Video).

xGen-MM-Vid: Efficient Video Token Compression

1. Visual Tokenization and Frame Encoding

2. Temporal Encoder Mechanisms

3. Token Compression Pipeline

4. Pretraining and Instructional Tuning

5. Empirical Performance and Benchmark Comparison

6. Computational Efficiency and Trade-Offs

7. Context, Limitations, and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

xGen-MM-Vid: Efficient Video Token Compression

1. Visual Tokenization and Frame Encoding

2. Temporal Encoder Mechanisms

3. Token Compression Pipeline

4. Pretraining and Instructional Tuning

5. Empirical Performance and Benchmark Comparison

6. Computational Efficiency and Trade-Offs

7. Context, Limitations, and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research