- The paper explores video understanding challenges in large multimodal models (LMMs), introduces the efficient Apollo family of LMMs, and provides empirical validation for effective design choices.
- The study establishes "Scaling Consistency," demonstrating that video-LMM design decisions transfer reliably from small to large models, and proposes ApolloBench for significantly faster video understanding evaluation.
- Apollo models achieve state-of-the-art results among similar-sized models, with Apollo-7B scoring 70.9 on MLVU and 63.3 on Video-MME, showcasing efficient performance through optimized architecture and training.
Overview
The paper presents a comprehensive investigation into video understanding within large multimodal models (LMMs). It identifies key challenges, particularly the high computational cost and limited interpretability of effective design choices. The authors systematically explore the video modeling design space and introduce Apollo—a family of LMMs that leverages empirical findings for efficient scaling and high-performance video understanding.
Core Contributions
Systematic Exploration of the Design Space:
The study rigorously examines several dimensions of video-LMM design, including video sampling strategies, vision encoding, token resampling, and token integration. The work confirms that design choices validated on smaller-scale models and datasets can be reliably transferred to larger models—a phenomenon termed as Scaling Consistency. This observation significantly lowers the experimental overhead by reducing the number of parameters and samples required to extrapolate performance on larger models.
ApolloBench and Evaluation Efficiency:
The paper proposes ApolloBench, a curated benchmark subset that reduces evaluation time by 41× relative to standard procedures. This benchmark is particularly tailored to assess temporal reasoning and video perception tasks, discarding queries that do not require video understanding. The benchmark serves as a robust and efficient tool for comparative evaluation across different video-LMM architectures.
Introduction of the Apollo Family of LMMs:
The authors introduce two specific models: Apollo-3B and Apollo-7B. Compared with existing systems, Apollo models demonstrate superior performance on several benchmarks:
- Apollo-3B outperforms most contemporary 7B-scale models—achieving a LongVideoBench score of 55.1.
- Apollo-7B achieves impressive scores, including 70.9 on MLVU and 63.3 on Video-MME, positioning it as a state-of-the-art model when compared to models with significantly higher parameter counts.
Key Technical Approaches
Scaling Consistency:
The investigation demonstrates that once a model reaches a critical size (approximately 2–4B parameters), design decisions tend to be robust across scales. This is quantified by an R² > 0.9 when correlating performance between small and large models. The paper finds that performance plateaus around 500K samples, which is critical for guiding efficient experimental setups and reducing compute expenditures.
Video Sampling Strategies:
The study contrasts fps sampling with uniform frame sampling and finds that fps-based approaches lead to markedly improved performance. Additionally, there is a significant trade-off between tokens per second (tps) and frames per second (fps), with optimal configurations found at 8–32 tokens per frame. The nuanced exploration of sampling methods provides insights for dynamic temporal down-sampling techniques in video understanding tasks.
Architecture and Encoding Choices:
Empirical results indicate that language-supervised vision encoders, notably SigLIP-SO400M, perform superiorly over self-supervised counterparts. When coupled with architectures such as InternVideo2, the combinatorial effect results in robust video representations.
- Token Resampling and Integration:
The authors highlight the effectiveness of Perceiver-based resampling mechanisms when reducing frame-specific tokens. Additionally, the strategic incorporation of auxiliary tokens (e.g., textual or learned tokens) between video tokens facilitates better temporal alignment without incurring significant computational overhead.
Training Schedules and Data Composition:
A multi-stage training regimen is presented, where model components are progressively unfrozen. Fine-tuning of video encoders on exclusively video data—as opposed to a mixed dataset—enhances performance on specialized reasoning and domain-specific tasks. The optimal strategy was found to lightly favor video data while maintaining a modest proportion of text data to support multimodal integration.
The empirical evaluation establishes the efficiency and scalability of the Apollo models:
Achieves a score of 55.1 on LongVideoBench and performs competitively on additional benchmarks such as Video-MME and ApolloBench.
Records state-of-the-art performance among 7B models with a score of 70.9 on MLVU and 63.3 on Video-MME, demonstrating that careful architectural decisions can yield performance benefits comparable to much larger models.
The study underscores that the unified architecture performs on par with or slightly better than split architectural designs, providing both parameter and computational efficiency. Moreover, the results validate that training decisions derived from smaller scale experiments can be robustly transferred to larger models without significant loss in predictive capability.
Implications for Future Work
The findings from this paper have several practical implications:
- Efficient Experimentation:
The establishment of scaling consistency allows researchers to minimize computational costs by conducting initial explorations on smaller models before scaling up.
ApolloBench represents a significant step forward in creating evaluation protocols that reduce compute time while rigorously testing temporal reasoning in video-LMM settings.
Practitioners can adopt the fps-based sampling, optimal token management, and unified architectures to build more efficient and effective video understanding systems.
Overall, the paper provides a detailed road map for the design and training of state-of-the-art video-LMMs, balancing between rigorous empirical analysis and practical engineering trade-offs. This level of insight is essential for advancing video understanding in complex multimodal settings while managing resource constraints.