AuroraLong: Scalable Long Video Understanding
- AuroraLong is a large multimodal model for long video understanding that combines a linear RNN backbone with visual token merging and reordering techniques.
- It replaces transformer-based approaches with a constant-memory RWKV-style architecture, achieving linear time and memory complexity per token.
- AuroraLong matches or exceeds transformer models on short and long video tasks while significantly reducing computational resources.
AuroraLong is a large multimodal model designed specifically for long video understanding by innovatively combining a scalable, linear recurrent neural network (RNN) backbone with visual token merging and reordering techniques. By departing from transformer-based approaches—which face prohibitive quadratic memory and computational costs on extended video sequences—AuroraLong enables efficient, open-ended reasoning over arbitrarily long video content while retaining or exceeding the accuracy of contemporary transformer models of similar or even vastly larger sizes (2507.02591).
1. Motivation and Architectural Overview
The core challenge in long video understanding stems from the rapid growth in sequence length: standard transformer-based LLM architectures incur complexity in time and space, making processing of thousands of video frames infeasible on common hardware. AuroraLong addresses this by substituting the transformer LLM with a linear RNN LLM (RWKV-style [RWKV-v6-Finch variant]), which maintains constant-size hidden states—yielding memory and computation per sequence.
AuroraLong follows the LLaVA paradigm: a strong visual encoder is paired with a LLM via a multimodal connector. The visual encoder is SigLIP, a Vision Transformer variant using large image patches to efficiently extract spatial features from frames. The processed visual tokens from all video frames are merged, reordered, and then fed into the RNN-based LLM for multimodal understanding and generation.
2. Linear RNN Model Design and Recurrence
AuroraLong utilizes the RWKV architecture, which fundamentally “reinvents” the RNN for LLMing. The RWKV employs a linear mechanism that updates its state at each step: where and are the per-token key and value vectors, is a learnable decay controlling memory persistence, and the output at token is computed from the current state and input.
The essential property is that sequence processing and memory requirements are constant with respect to input length: each token’s output depends solely on a fixed-size, recurrent state summary of the past. This allows AuroraLong to process sequences orders of magnitude longer than transformer-based models on the same GPU memory, eliminating the need for truncation, sliding-window inference, or memory-augmented mechanisms.
3. Visual Token Merging and Reordering
A critical innovation in AuroraLong is the large-scale merging and reordering of visual tokens, which are generated by SigLIP for every spatial patch of every frame:
- Token Merging: Inspired by ToMe, after extracting visual embeddings from each frame, tokens are partitioned into two interleaved groups, and . Cosine similarity is calculated between each and all , and the most similar pairs are merged by weighted averaging their representations. Each resultant token tracks the number of input patches it now covers ("token size" or ).
- Sorted Reordering: All merged tokens are sorted in ascending order by their token size . The rationale is that tokens representing “finer” spatial details (unmerged or lightly merged, small ) are placed early in the sequence. This ordering leverages the unidirectional, position-implicit encoding property of RNNs, ensuring salient spatial information is accessible at the start of processing.
The vision encoder’s attention is then modified to account for token coarseness: where is the model dimension and is token size. This adjustment downweights tokens that aggregate information over larger spatial regions.
4. Performance and Benchmarking
Even with a modest parameter count of 2B (trained on public datasets only), AuroraLong achieves or exceeds the performance of similarly-sized and even much larger transformer-based video LLMs on both short and long video tasks:
- Short Video Tasks: On benchmarks such as VDC, ANet, and VATEX, AuroraLong matches or outperforms models like Gemini-1.5-Pro.
- Long Video/Context Tasks: On MLVU and MovieChat-1K—requiring understanding of long, multi-part videos—AuroraLong achieves higher mean and detail scores while using 48 frames and a 4k token context. Competing transformers cannot practically scale to such context lengths due to memory explosion.
- Resource Efficiency: AuroraLong can process 10,000+ video frames on a single 24GB GPU without auxiliary memory tricks, a scenario infeasible for transformer architectures due to quadratic scaling.
A summary table:
Model | Architecture | Max Context | GPU Memory | Short Video | Long Video |
---|---|---|---|---|---|
AuroraLong | Linear RNN | 10,000 | Low | SOTA/practical | SOTA/practical |
Gemini 1.5 Pro | Transformer | 4,000 | High | SOTA | Not scalable |
5. Architectural Implications and Design Considerations
AuroraLong’s efficiency arises from:
- Constant memory and linear inference time per token due to the linear-RNN backbone.
- Ability to handle arbitrarily long sequences without windowing, maintaining global temporal coherence throughout the video.
- No requirement for explicit positional encodings; sequence order is inherently modeled by the recurrence.
- Strategic visual token merging and ascending-sorted reordering ensures that merged video tokens are compatible with the RNN’s sequence processing, minimizing the information loss that can result from aggressive token reduction.
Each step, from vision encoding to token merging, reordering, and linear RNN inference, is modular—enabling future improvements or swaps (e.g., stronger visual backbones or enhanced token merging techniques).
6. Broader Impacts and Future Directions
AuroraLong demonstrates that bringing RNNs—especially variants like RWKV with linear attention—into the large-scale multimodal modeling landscape can dramatically widen accessibility for open-ended video understanding. Its modest resource requirements lower the barrier for research, application, and deployment across domains where long video analysis is essential, including surveillance, movie analysis, scientific video archives, and more.
Prospective advances include scaling the model, integrating stronger vision encoders, exploring adaptive or learned token aggregation, and extending the approach to other modalities where sequence length and memory bottlenecks are persistent concerns.
By maintaining high-quality reasoning over extremely long video contexts, AuroraLong signposts a paradigm shift away from transformer-dominated long sequence models, offering an attractive blueprint for the next generation of efficient, scalable multimodal AI systems (2507.02591).