xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs (2410.16267v1)

Published 21 Oct 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: We present xGen-MM-Vid (BLIP-3-Video): a multimodal LLM for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the 'temporal encoder' in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much fewer visual tokens than its competing models (e.g., 32 vs. 4608 tokens). We explore different types of temporal encoders, including learnable spatio-temporal pooling as well as sequential models like Token Turing Machines. We experimentally confirm that BLIP-3-Video obtains video question-answering accuracies comparable to much larger state-of-the-art models (e.g., 34B), while being much smaller (i.e., 4B) and more efficient by using fewer visual tokens. The project website is at https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html

PDF HTML Abstract

An Analysis of xGen-MM-Vid (BLIP-3-Video): Efficient Multimodal LLMing for Video

The paper presents xGen-MM-Vid (BLIP-3-Video), a multimodal LLM specifically tailored for handling video data. xGen-MM-Vid incorporates a novel 'temporal encoder' into the traditional visual tokenizer, enabling the model to represent an entire video using significantly fewer tokens—32, as opposed to thousands. This tokenizer efficiency positions BLIP-3-Video as an advanced solution for video processing in vision-LLMs (VLMs).

Model Architecture and Methodology

xGen-MM-Vid is developed from the image-based BLIP-3 architecture, to which a temporal encoder is added. Key components include:

Vision Encoder (ViT): Processes individual frames.
Frame-Level Tokenizer: Reduces token count per frame.
Temporal Encoder: Abstracts multiple frame-level tokens into a compact set of video-level tokens.
Autoregressive LLM (Phi-3): Generates text based on video and text input.

The temporal encoder is pivotal, transforming a high-dimensional frame token sequence to a more condensed video-level representation. The authors explore various types of encoders, such as spatio-temporal attentional pooling (utilizing TokenLearner) and sequential models (using Token Turing Machines).

Key Findings and Contributions

The researchers achieve competitive performance in video question-answering benchmarks with BLIP-3-Video, using merely 16-32 tokens, as opposed to models like Tarsier and Video-LLaVA, which require thousands of tokens.

Performance: Comparable results to state-of-the-art models are observed in benchmarks such as MSVD-QA and NExT-QA, even though BLIP-3-Video operates with a parameter size of 4B, smaller than several competing models.
Efficiency: Reduction in visual tokens speaks to increased computational efficiency, evidenced by measurable increases in processing speed (samples per second).

Implications and Future Directions

Beyond immediate efficiency gains, this work suggests a potential shift towards leveraging compact representations in VLMs, which could pave the way for models that are faster while retaining accuracy, thus making them more suitable for real-world applications.

From a theoretical perspective, the exploration of different temporal encoders brings valuable insights into effective video abstraction mechanisms, potentially informing future innovations in this space.

Advancements in token reduction can further integrate with emerging areas such as edge AI and real-time processing, where computational resources are limited.

Conclusion

The introduction of xGen-MM-Vid (BLIP-3-Video) marks a noteworthy advancement in efficient video representation within multimodal models. While challenges remain in optimizing and applying such methods across diverse datasets and domains, the work lays a robust foundation for future exploration in efficient VLM designs.