xGen-MM-Vid: Efficient Video Token Compression
- xGen-MM-Vid is a multimodal video model that compresses thousands of patch features into as few as 32 tokens, drastically reducing computational overhead.
- It integrates Perceiver-based frame tokenization with two temporal encoding strategies (TokenLearner and Sequential TTM) to effectively capture spatio-temporal dynamics.
- Empirical results show that, with only 4B parameters, the model achieves competitive video QA and captioning performance compared to larger counterparts.
xGen-MM-Vid (BLIP-3-Video) is a multimodal LLM (VLM) designed for efficient video understanding by compressing temporal and spatial visual information into extremely compact token representations. The architecture introduces a temporal encoder that, together with a visual tokenizer, enables the mapping of entire video clips to as few as 32 visual tokens. This schema allows the model, with only 4B parameters, to achieve video question-answering and captioning performance on par with substantially larger models while using orders of magnitude fewer visual tokens (Ryoo et al., 2024).
1. Visual Tokenization and Frame Encoding
The foundation of xGen-MM-Vid’s video pipeline is its frame-wise visual tokenizer. For each raw frame (where ), the image is partitioned into non-overlapping patches, yielding patches per frame. Each patch is linearly embedded:
A Perceiver-Resampler, functioning as a light-weight cross-attention module, reduces the set to tokens per frame, compactly represented as . For uniformly sampled frames, the tokenized video is concatenated as .
2. Temporal Encoder Mechanisms
To obtain a highly compressed global representation of the entire video, xGen-MM-Vid employs either:
a) Learnable Spatio-Temporal Pooling (TokenLearner):
A pooling function (for input tokens and output tokens) is implemented as an MLP over the input token matrix. The learned attention matrix is produced as with . The final spatio-temporal tokens are obtained by , i.e., as a weighted sum that “soft-selects” informative tokens across frames and space.
b) Sequential Token Turing Machine (TTM):
A memory state maintains temporal context by recurrently integrating each frame’s tokens . The memory updates are performed by a small 4-layer Transformer processor , encompassing TokenLearner read/write modules:
After processing all frames, the final memory is pooled again by a TokenLearner module to obtain global tokens, , with typical .
3. Token Compression Pipeline
The token reduction is carried out in two abstraction stages. First, from raw patch embeddings per video, the Perceiver compresses to frame tokens. Next, either TokenLearner or TTM with pooling produces compressed visual tokens—commonly —by applying a learned attention mechanism such that . Thus, the architecture achieves an extreme compression ratio, distilling thousands of patch features into a succinct set of global video tokens.
4. Pretraining and Instructional Tuning
BLIP-3-Video’s training regime has three stages, each employing the standard next-token cross-entropy loss over the LLM’s text outputs:
Stage 1: Image-Caption Pretraining
- Model initialized from BLIP-3; SigLIP and Perceiver weights are frozen; temporal encoder initialized randomly.
Stage 2: Video-Caption Pretraining
- 900K video-caption pairs from LLaVA-Hound-DPO, captions rephrased by GPT-4.
- Loss: .
- Trained for 1 epoch with Adam optimizer, 8H100 GPUs, batch 16/GPU, learning rate , 500 warmup steps, cosine decay.
Stage 3: Video Instruction Tuning
- Mixed corpus: 99K VideoChatGPT, MSVD-QA (30K), MSRVTT-QA (149K), ActivityNet-QA (32K), TGIF-QA (71K), NExT-QA (34K), Mira (935K captions).
- Cross-entropy on open-ended QA (answers rephrased by GPT-3.5); multiple-choice QA via candidate score cross-entropy.
- Batch 4/GPU, learning rate , 1 epoch (12 hours).
5. Empirical Performance and Benchmark Comparison
The following table summarizes video QA and captioning results reported for BLIP-3-Video (4B, 32 tokens) against published baselines.
| Task / Metric | BLIP-3-Video (4B, 32t) | Tarsier-7B (4608t) | PLLaVA-7B (576t) |
|---|---|---|---|
| MSVD-QA Acc./Score | 77.7 / 4.2 | 77.0 / 4.1 | 76.6 / 4.1 |
| MSRVTT-QA Acc./Score | 60.0 / 3.6 | — | — |
| ActivityNet-QA Acc./Score | 55.7 / 3.5 | — | — |
| TGIF-QA Acc./Score | 76.5 / 4.3 | 79.2 / 4.2 | 77.5 / 4.1 |
| NExT-QA Accuracy (32t) | 76.4% | 71.6% | — |
| NExT-QA Accuracy (128t) | 77.1% | 79.2% (34B) | — |
| MSVD-Cap Acc./Score | 63.6 / 3.38 | 62.3 / 3.37 | — |
| MSRVTT-Cap Acc./Score | 42.1 / 2.82 | 40.3 / 2.77 | — |
| Mira-Cap Acc./Score | 80.7 / 3.96 | 40.6 / 2.87 | — |
Key findings are that BLIP-3-Video, using only 32 tokens, achieves accuracy comparable to much larger competitors (Tarsier-34B, PLLaVA-7B), despite a far smaller footprint and dramatically fewer tokens per video (Ryoo et al., 2024).
6. Computational Efficiency and Trade-Offs
The design targets the quadratic computational cost of self-attention with respect to total tokens in the LLM. Reducing the context from 1024 to 32 tokens yields a increase in training and inference throughput per GPU (from 3.3 to 8.2 samples/sec/GPU). Despite this aggressive compression, the learned spatio-temporal abstractions retain sufficient semantic information for competitive QA and captioning performance.
However, several trade-offs exist:
- The fixed selection of 8 frames limits modeling of long-range temporal dependencies.
- Excessive compression (e.g., ) may blur fine spatial details.
- Hallucinations occasionally persist in open-ended language outputs.
7. Context, Limitations, and Outlook
xGen-MM-Vid (BLIP-3-Video) demonstrates that careful design of tokenizers and temporal encoders enables extreme reduction of visual context with minimal loss in downstream accuracy. The model achieves this via Perceiver-based frame compression and learnable spatio-temporal pooling or TTM memory modules. While the current instantiation processes only 8 frames and compresses down to 32 tokens, extension to longer-range dynamics and finer granularity may be warranted for applications with high temporal variability. The persistent hallucination phenomenon and trade-off between token count and spatial detail remain open areas for further study (Ryoo et al., 2024).