Papers
Topics
Authors
Recent
Search
2000 character limit reached

xGen-MM-Vid: Efficient Video Token Compression

Updated 26 January 2026
  • xGen-MM-Vid is a multimodal video model that compresses thousands of patch features into as few as 32 tokens, drastically reducing computational overhead.
  • It integrates Perceiver-based frame tokenization with two temporal encoding strategies (TokenLearner and Sequential TTM) to effectively capture spatio-temporal dynamics.
  • Empirical results show that, with only 4B parameters, the model achieves competitive video QA and captioning performance compared to larger counterparts.

xGen-MM-Vid (BLIP-3-Video) is a multimodal LLM (VLM) designed for efficient video understanding by compressing temporal and spatial visual information into extremely compact token representations. The architecture introduces a temporal encoder that, together with a visual tokenizer, enables the mapping of entire video clips to as few as 32 visual tokens. This schema allows the model, with only 4B parameters, to achieve video question-answering and captioning performance on par with substantially larger models while using orders of magnitude fewer visual tokens (Ryoo et al., 2024).

1. Visual Tokenization and Frame Encoding

The foundation of xGen-MM-Vid’s video pipeline is its frame-wise visual tokenizer. For each raw frame xtRH×W×3x_t \in \mathbb{R}^{H \times W \times 3} (where H=W=384H=W=384), the image is partitioned into non-overlapping 16×1616 \times 16 patches, yielding =(H/p)(W/p)=2424=576\ell=(H/p) \cdot (W/p)=24 \cdot 24=576 patches per frame. Each patch xt,iR16×16×3x_{t,i} \in \mathbb{R}^{16 \times 16 \times 3} is linearly embedded:

zt,i=Wevec(xt,i)+be,zt,iR1152z_{t,i} = W_e \cdot \text{vec}(x_{t,i}) + b_e, \quad z_{t,i} \in \mathbb{R}^{1152}

A Perceiver-Resampler, functioning as a light-weight cross-attention module, reduces the set {zt,1,...,zt,576}\{z_{t,1},...,z_{t,576}\} to N=128N=128 tokens per frame, compactly represented as VtR128×dV_t \in \mathbb{R}^{128 \times d}. For T=8T=8 uniformly sampled frames, the tokenized video is concatenated as V=[V1;...;V8]R1024×dV = [V_1;...;V_8] \in \mathbb{R}^{1024 \times d}.

2. Temporal Encoder Mechanisms

To obtain a highly compressed global representation of the entire video, xGen-MM-Vid employs either:

a) Learnable Spatio-Temporal Pooling (TokenLearner):

A pooling function A:Rd×KRM×KA: \mathbb{R}^{d \times K} \rightarrow \mathbb{R}^{M \times K} (for K=1024K=1024 input tokens and MM output tokens) is implemented as an MLP over the input token matrix. The learned attention matrix A(V)A(V) is produced as A(V)ij=softmax(α(VT))ijA(V)_{ij} = \text{softmax}(\alpha(V^T))_{ij} with α(VT)=MLPm(VT)RM×K\alpha(V^T) = \text{MLP}_m(V^T) \in \mathbb{R}^{M \times K}. The final MM spatio-temporal tokens are obtained by X=A(V)VX = A(V) \cdot V, i.e., as a weighted sum that “soft-selects” informative tokens across frames and space.

b) Sequential Token Turing Machine (TTM):

A memory state Mt1RGN×dM_{t-1} \in \mathbb{R}^{G \cdot N \times d} maintains temporal context by recurrently integrating each frame’s tokens VtV_t. The memory updates are performed by a small 4-layer Transformer processor fθf_\theta, encompassing TokenLearner read/write modules:

Mt=fθ(Mt1,Vt)M_t = f_\theta(M_{t-1}, V_t)

After processing all frames, the final memory MTR512×dM_T \in \mathbb{R}^{512 \times d} is pooled again by a TokenLearner module to obtain MM global tokens, X=Aseq(MT)MTX = A_{seq}(M_T) \cdot M_T, with typical M=32M = 32.

3. Token Compression Pipeline

The token reduction is carried out in two abstraction stages. First, from 8×729=58328 \times 729 = 5832 raw patch embeddings per video, the Perceiver compresses to 8×128=10248 \times 128 = 1024 frame tokens. Next, either TokenLearner or TTM with pooling produces MM compressed visual tokens—commonly M=32M=32—by applying a learned attention mechanism AR32×1024A \in \mathbb{R}^{32 \times 1024} such that X=AVX = A \cdot V. Thus, the architecture achieves an extreme compression ratio, distilling thousands of patch features into a succinct set of global video tokens.

4. Pretraining and Instructional Tuning

BLIP-3-Video’s training regime has three stages, each employing the standard next-token cross-entropy loss over the LLM’s text outputs:

Stage 1: Image-Caption Pretraining

  • Model initialized from BLIP-3; SigLIP and Perceiver weights are frozen; temporal encoder initialized randomly.

Stage 2: Video-Caption Pretraining

  • 900K video-caption pairs from LLaVA-Hound-DPO, captions rephrased by GPT-4.
  • Loss: Lcap=i=1LlogP(wiw<i,X)\mathcal{L}_{cap} = - \sum_{i=1}^L \log P(w_i|w_{<i}, X).
  • Trained for 1 epoch with Adam optimizer, 8×\timesH100 GPUs, batch 16/GPU, learning rate 2×1052 \times 10^{-5}, 500 warmup steps, cosine decay.

Stage 3: Video Instruction Tuning

  • Mixed corpus: 99K VideoChatGPT, MSVD-QA (30K), MSRVTT-QA (149K), ActivityNet-QA (32K), TGIF-QA (71K), NExT-QA (34K), Mira (935K captions).
  • Cross-entropy on open-ended QA (answers rephrased by GPT-3.5); multiple-choice QA via candidate score cross-entropy.
  • Batch 4/GPU, learning rate 10510^{-5}, 1 epoch (\sim12 hours).

5. Empirical Performance and Benchmark Comparison

The following table summarizes video QA and captioning results reported for BLIP-3-Video (4B, 32 tokens) against published baselines.

Task / Metric BLIP-3-Video (4B, 32t) Tarsier-7B (4608t) PLLaVA-7B (576t)
MSVD-QA Acc./Score 77.7 / 4.2 77.0 / 4.1 76.6 / 4.1
MSRVTT-QA Acc./Score 60.0 / 3.6
ActivityNet-QA Acc./Score 55.7 / 3.5
TGIF-QA Acc./Score 76.5 / 4.3 79.2 / 4.2 77.5 / 4.1
NExT-QA Accuracy (32t) 76.4% 71.6%
NExT-QA Accuracy (128t) 77.1% 79.2% (34B)
MSVD-Cap Acc./Score 63.6 / 3.38 62.3 / 3.37
MSRVTT-Cap Acc./Score 42.1 / 2.82 40.3 / 2.77
Mira-Cap Acc./Score 80.7 / 3.96 40.6 / 2.87

Key findings are that BLIP-3-Video, using only 32 tokens, achieves accuracy comparable to much larger competitors (Tarsier-34B, PLLaVA-7B), despite a far smaller footprint and dramatically fewer tokens per video (Ryoo et al., 2024).

6. Computational Efficiency and Trade-Offs

The design targets the quadratic computational cost of self-attention with respect to total tokens in the LLM. Reducing the context from 1024 to 32 tokens yields a 23×2-3\times increase in training and inference throughput per GPU (from 3.3 to 8.2 samples/sec/GPU). Despite this aggressive compression, the learned spatio-temporal abstractions retain sufficient semantic information for competitive QA and captioning performance.

However, several trade-offs exist:

  • The fixed selection of 8 frames limits modeling of long-range temporal dependencies.
  • Excessive compression (e.g., M<16M<16) may blur fine spatial details.
  • Hallucinations occasionally persist in open-ended language outputs.

7. Context, Limitations, and Outlook

xGen-MM-Vid (BLIP-3-Video) demonstrates that careful design of tokenizers and temporal encoders enables extreme reduction of visual context with minimal loss in downstream accuracy. The model achieves this via Perceiver-based frame compression and learnable spatio-temporal pooling or TTM memory modules. While the current instantiation processes only 8 frames and compresses down to 32 tokens, extension to longer-range dynamics and finer granularity may be warranted for applications with high temporal variability. The persistent hallucination phenomenon and trade-off between token count and spatial detail remain open areas for further study (Ryoo et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to xGen-MM-Vid (BLIP-3-Video).