Apollo: An Exploration of Video Understanding in Large Multimodal Models

Published 13 Dec 2024 in cs.CV and cs.AI | (2412.10360v1)

Abstract: Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models, coupled with limited open research, hinders the development of video-LMMs. To address this, we present a comprehensive study that helps uncover what effectively drives video understanding in LMMs. We begin by critically examining the primary contributors to the high computational requirements associated with video-LMM research and discover Scaling Consistency, wherein design and training decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. Leveraging these insights, we explored many video-specific aspects of video-LMMs, including video sampling, architectures, data composition, training schedules, and more. For example, we demonstrated that fps sampling during training is vastly preferable to uniform frame sampling and which vision encoders are the best for video representation. Guided by these findings, we introduce Apollo, a state-of-the-art family of LMMs that achieve superior performance across different model sizes. Our models can perceive hour-long videos efficiently, with Apollo-3B outperforming most existing $7$B models with an impressive 55.1 on LongVideoBench. Apollo-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME.

Abstract PDF HTML Upgrade to Chat

Authors (12)

Citations (1)

View on Semantic Scholar

Summary

The paper explores video understanding challenges in large multimodal models (LMMs), introduces the efficient Apollo family of LMMs, and provides empirical validation for effective design choices.
The study establishes "Scaling Consistency," demonstrating that video-LMM design decisions transfer reliably from small to large models, and proposes ApolloBench for significantly faster video understanding evaluation.
Apollo models achieve state-of-the-art results among similar-sized models, with Apollo-7B scoring 70.9 on MLVU and 63.3 on Video-MME, showcasing efficient performance through optimized architecture and training.

Overview

The paper presents a comprehensive investigation into video understanding within large multimodal models (LMMs). It identifies key challenges, particularly the high computational cost and limited interpretability of effective design choices. The authors systematically explore the video modeling design space and introduce Apollo—a family of LMMs that leverages empirical findings for efficient scaling and high-performance video understanding.

Core Contributions

Systematic Exploration of the Design Space:

The study rigorously examines several dimensions of video-LMM design, including video sampling strategies, vision encoding, token resampling, and token integration. The work confirms that design choices validated on smaller-scale models and datasets can be reliably transferred to larger models—a phenomenon termed as Scaling Consistency. This observation significantly lowers the experimental overhead by reducing the number of parameters and samples required to extrapolate performance on larger models.

ApolloBench and Evaluation Efficiency:

The paper proposes ApolloBench, a curated benchmark subset that reduces evaluation time by 41× relative to standard procedures. This benchmark is particularly tailored to assess temporal reasoning and video perception tasks, discarding queries that do not require video understanding. The benchmark serves as a robust and efficient tool for comparative evaluation across different video-LMM architectures.

Introduction of the Apollo Family of LMMs:

The authors introduce two specific models: Apollo-3B and Apollo-7B. Compared with existing systems, Apollo models demonstrate superior performance on several benchmarks:

Apollo-3B outperforms most contemporary 7B-scale models—achieving a LongVideoBench score of 55.1.
Apollo-7B achieves impressive scores, including 70.9 on MLVU and 63.3 on Video-MME, positioning it as a state-of-the-art model when compared to models with significantly higher parameter counts.

Key Technical Approaches

Scaling Consistency:

The investigation demonstrates that once a model reaches a critical size (approximately 2–4B parameters), design decisions tend to be robust across scales. This is quantified by an R² > 0.9 when correlating performance between small and large models. The paper finds that performance plateaus around 500K samples, which is critical for guiding efficient experimental setups and reducing compute expenditures.

Video Sampling Strategies:

The study contrasts fps sampling with uniform frame sampling and finds that fps-based approaches lead to markedly improved performance. Additionally, there is a significant trade-off between tokens per second (tps) and frames per second (fps), with optimal configurations found at 8–32 tokens per frame. The nuanced exploration of sampling methods provides insights for dynamic temporal down-sampling techniques in video understanding tasks.

Architecture and Encoding Choices:

Vision Encoders:

Empirical results indicate that language-supervised vision encoders, notably SigLIP-SO400M, perform superiorly over self-supervised counterparts. When coupled with architectures such as InternVideo2, the combinatorial effect results in robust video representations.

Token Resampling and Integration:

The authors highlight the effectiveness of Perceiver-based resampling mechanisms when reducing frame-specific tokens. Additionally, the strategic incorporation of auxiliary tokens (e.g., textual or learned tokens) between video tokens facilitates better temporal alignment without incurring significant computational overhead.

Training Schedules and Data Composition:

A multi-stage training regimen is presented, where model components are progressively unfrozen. Fine-tuning of video encoders on exclusively video data—as opposed to a mixed dataset—enhances performance on specialized reasoning and domain-specific tasks. The optimal strategy was found to lightly favor video data while maintaining a modest proportion of text data to support multimodal integration.

Experimental Findings and Performance Metrics

The empirical evaluation establishes the efficiency and scalability of the Apollo models:

Apollo-3B:

Achieves a score of 55.1 on LongVideoBench and performs competitively on additional benchmarks such as Video-MME and ApolloBench.

Apollo-7B:

Records state-of-the-art performance among 7B models with a score of 70.9 on MLVU and 63.3 on Video-MME, demonstrating that careful architectural decisions can yield performance benefits comparable to much larger models.

The study underscores that the unified architecture performs on par with or slightly better than split architectural designs, providing both parameter and computational efficiency. Moreover, the results validate that training decisions derived from smaller scale experiments can be robustly transferred to larger models without significant loss in predictive capability.

Implications for Future Work

The findings from this paper have several practical implications:

Efficient Experimentation:

The establishment of scaling consistency allows researchers to minimize computational costs by conducting initial explorations on smaller models before scaling up.

Benchmarking:

ApolloBench represents a significant step forward in creating evaluation protocols that reduce compute time while rigorously testing temporal reasoning in video-LMM settings.

Design Guidelines:

Practitioners can adopt the fps-based sampling, optimal token management, and unified architectures to build more efficient and effective video understanding systems.

Overall, the paper provides a detailed road map for the design and training of state-of-the-art video-LMMs, balancing between rigorous empirical analysis and practical engineering trade-offs. This level of insight is essential for advancing video understanding in complex multimodal settings while managing resource constraints.

Markdown Report Issue