VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation (2408.16730v1)

Published 29 Aug 2024 in cs.CV

Abstract: A well-known dilemma in large vision-LLMs (e.g., GPT-4, LLaVA) is that while increasing the number of vision tokens generally enhances visual understanding, it also significantly raises memory and computational costs, especially in long-term, dense video frame streaming scenarios. Although learnable approaches like Q-Former and Perceiver Resampler have been developed to reduce the vision token burden, they overlook the context causally modeled by LLMs (i.e., key-value cache), potentially leading to missed visual cues when addressing user queries. In this paper, we introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video. Specifically, for each transformer layer, we learn to skip the computation for a high proportion (e.g., 80\%) of vision tokens, passing them directly to the next layer. This approach significantly enhances model efficiency, achieving approximately \textasciitilde42\% time and \textasciitilde30\% memory savings for the entire training. Moreover, our method reduces the computation in the context and avoid decreasing the vision tokens, thus preserving or even improving performance compared to the vanilla model. We conduct extensive experiments to demonstrate the effectiveness of VideoLLM-MoD, showing its state-of-the-art results on multiple benchmarks, including narration, forecasting, and summarization tasks in COIN, Ego4D, and Ego-Exo4D datasets.

PDF HTML Abstract

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

The paper introduces VideoLLM-MoD, an efficient method to handle the computational and memory demands of large vision-LLMs when processing continuous video streaming. Existing large vision-LLMs like GPT-4 and LLaVA typically increase visual understanding by raising the number of vision tokens, which correspondingly increases computational costs substantially in dense video frame scenarios. In this context, the paper identifies the limitations of current learnable approaches such as Q-Former and Perceiver Resampler, which attempt to mitigate these costs but fail to maintain the context required by LLMs, potentially leading to missing vital visual cues.

Key Contributions

Efficiency through Skipping Layers:
- The primary innovation of VideoLLM-MoD lies in reducing vision compute by leveraging redundant vision tokens via "skipping layers" instead of decreasing the number of vision tokens.
- This method is inspired by mixture-of-depths (MoD) LLMs, which balance the depth of computation across different layers in the model to optimize resource usage.
- The proposed approach learns to skip computations for a high proportion (e.g., 80%) of vision tokens in each transformer layer, passing them directly to the next layer. This results in significant time (~42% reduction) and memory (~30% savings) efficiency improvements during the training process.
Maintaining Context and Performance:
- Unlike direct token dropping or merging techniques, the skip-within-context mechanism preserves the contextual integrity, thereby maintaining or enhancing model performance.
- Extensive experiments demonstrate the effectiveness of VideoLLM-MoD, showing state-of-the-art results across multiple benchmarks, including narration, forecasting, and summarization tasks on the COIN, Ego4D, and EgoExo4D datasets.

Methodology

The method integrates three essential components: an image encoder, an MLP projector, and a LLM. Vision tokens are processed selectively through a LayerExpert module, which evaluates the critical importance of each token within every frame and only processes those deemed essential. This selection mechanism employs a top- $k$ strategy, dynamically routing vision tokens based on their calculated importance scores, thus optimizing computational efficiency without compromising on crucial visual information.

Experimental Framework

Datasets and Evaluation Metrics

Ego4D Narration Stream: Utilized to generate timely narrations aligned with human annotators.
Ego4D Long-term Action Anticipation (LTA): Predicts future actions based on previous steps.
EgoExo4D Fine-grained Keystep Recognition: Recognizes key steps in procedural videos using multiple viewpoints.
COIN Benchmarks: Includes step recognition, step forecasting, task summarization, and procedure forecasting tasks.

The evaluation metrics include LLMing Perplexity (LM-PPL), Time Difference (TimeDiff), and Fluency to measure LLMing capabilities and temporal alignment effectiveness in the online scenario.

Comparative Analysis

Experiments compare VideoLLM-MoD against several baselines:

VideoLLM-online: Processes video using a single CLS token per frame, found less effective for detailed visual understanding.
Full-computation: Processes all vision tokens through every layer, achieving higher performance but at exponential computational costs.
EarlyExit and LayerSkip: Methods involve partial vision token processing but suffer from losing critical visual information.

VideoLLM-MoD demonstrates superior performance by achieving a robust balance between computational efficiency and model accuracy, especially in complex video tasks.

Future Implications and Broader Impacts

The methodology and results of this research have significant implications:

Scalability: VideoLLM-MoD exemplifies an efficient scalability model for continuous video processing, crucial for real-world applications like augmented reality and autonomous systems.
Theoretical Contributions: The integration of mixture-of-depth principles into vision-LLMs opens up new avenues for optimizing deep learning architectures for other tasks involving large-scale temporal data.
Potential for Further Research: Future research could explore the granularity of token selection and the adaptive mechanisms for different video types and content complexity.

In summary, VideoLLM-MoD presents a substantial advancement in the efficient processing of large-scale, continuous video data with LLMs, balancing the intricacies of computational resource management and maintaining high-performance standards. This work paves the way for more resource-efficient, real-time video-language applications, promising significant impacts across various domains requiring sophisticated video understanding.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Shiwei Wu (38 papers)
Joya Chen (18 papers)
Kevin Qinghong Lin (28 papers)
Qimeng Wang (11 papers)
Yan Gao (157 papers)
Qianli Xu (10 papers)
Tong Xu (113 papers)
Yao Hu (106 papers)
Enhong Chen (242 papers)
Mike Zheng Shou (165 papers)

Citations (3)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/gm8xx8/status/1829366366702272599

YouTube

Show All Videos