- The paper presents MC-ViT, which repurposes pretrained video transformers using non-parametric memory consolidation to process long video sequences.
- It achieves a tenfold reduction in memory and computation by compressing past activations, enabling extended temporal comprehension.
- MC-ViT sets state-of-the-art benchmarks in fine-grained action recognition and video question answering, outperforming larger models.
Overview
The paper "Memory Consolidation Enables Long-Context Video Understanding" introduces a novel approach to video processing, addressing the limitations of current transformer-based video encoders. These encoders typically struggle with efficiently managing long temporal contexts due to their quadratic complexity in relation to sequence length. The proposed memory-consolidated vision transformer (MC-ViT) leverages non-parametric memory consolidation to effectively extend the temporal comprehension of pretrained video transformers without necessitating architectural modifications.
Key Contributions
Simplifying Long-context Video Understanding
A highlight of MC-ViT is its ability to repurpose existing pretrained video transformers for long-context understanding. This is achieved by fine-tuning them to attend to memories derived from past activations. This method proves to be significantly less complex, both conceptually and computationally, compared to previous attempts at extending the temporal capabilities of video transformers.
Improving Efficiency and Expressivity
MC-ViT introduces an appealing trade-off between computational complexity and model expressivity. It outperforms both standard video transformers and their efficient approximations by requiring significantly less memory and computation (10× less) for learning from longer videos.
MC-ViT establishes new benchmarks across several long-context video understanding tasks. Notably, it achieves state-of-the-art performance in fine-grained action recognition (Diving48) and video question answering tasks (EgoSchema and Perception Test), even surpassing methods that utilize considerably more parameters.
Competitive with Large-scale Models
Despite its smaller, standardized, open architecture, MC-ViT compares favorably with large-scale proprietary systems such as GPT-4V and Bard. This is impressive given the latter's use of significantly more parameters and extensive pretraining on diverse internet datasets.
Methodology
Efficient Handling of Long Sequences
MC-ViT's core innovation lies in its memory consolidation strategy, allowing it to efficiently process long video sequences. Through non-parametric mechanisms inspired by insights from psychology and neuroscience, MC-ViT compresses memories by an order of magnitude, thus enabling extended context comprehension while maintaining bounded complexity.
Non-Parametric Memory Consolidation
The paper explores three instances of memory consolidation—random, coreset, and k-means-based methods—demonstrating the effective compression of past activations into a manageable set of memories. This methodology is integral to MC-ViT's ability to generalize from short training sequences to much longer ones, significantly reducing the required computational resources.
Conclusion
"Memory Consolidation Enables Long-Context Video Understanding" presents a compelling advancement in video processing, addressing the critical challenge of efficiently managing long temporal sequences without the need for complex architectural enhancements or extensive computational resources. MC-ViT not only sets new performance standards across various benchmarks but also highlights the potential of memory consolidation as a powerful tool for future research in video understanding and beyond.