LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

Published 17 May 2026 in cs.CV | (2605.17260v1)

Abstract: The fundamental challenge in scaling Video LLMs (Video LLMs) to long-form video lies in managing the explosion of visual-token context length. Existing strategies predominantly focus on "post-hoc" token reduction -- reducing visual tokens after feature extraction to alleviate the LLM's computational overhead. While these methods effectively reduce the number of visual tokens, we observe that the primary latency bottleneck then shifts from the LLM to the expensive per-frame processing of the vision encoder. To address this, we introduce LiteFrame, a strong, yet highly efficient video encoder backbone for Video LLMs. To train LiteFrame, we propose Compressed Token Distillation (CTD), a novel training framework that teaches a compact student vision encoder to directly predict information-dense, spatio-temporally compressed representations produced by a large teacher vision model, effectively bypassing redundant computation. When coupled with further LLM Adaptation (LMA), this approach results in a new latency-accuracy Pareto frontier -- compared with InternVL3-8B, LiteFrame provides a 35% reduction in end-to-end latency while processing 8$\times$ more frames and improves average video understanding accuracy across multiple benchmarks. Our results demonstrate a new potential path to unlocking longer-form video understanding under fixed compute budgets.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper demonstrates that leveraging token-compression through CTD and LMA can reduce inference time by up to 35% while processing 8x more frames.
It introduces a novel architecture that interleaves spatial attention with depth-wise temporal convolutions for efficient spatio-temporal downsampling.
The study validates its approach on multiple benchmarks, showing improved accuracy and efficiency over traditional post-hoc token reduction methods.

LiteFrame: Internalizing Efficient Vision Encoding for Scalable Video LLMs

Motivation and Identified Bottleneck

Recent advances in Video LLMs (Video LLMs) have been primarily limited by the quadratic complexity of visual token processing in LLMs for long-form video inputs. Existing strategies emphasize post-hoc token reduction—compressing tokens after feature extraction via dense vision encoders—to alleviate LLM overhead. However, this paradigm ignores a fundamental scaling bottleneck: as LLM context length increases, the cumulative latency shifts to the vision encoder’s per-frame processing. The dominant architectural flow thus becomes increasingly inefficient, especially for models that must handle significant temporal context. LiteFrame directly targets the inefficiency in visual token extraction, reworking the paradigm to enable scaling with both higher frame counts and reduced latency.

Methodology: LiteFrame Architecture and Training

The LiteFrame encoder is architected as a compact, token-compressive vision transformer, embedding explicit spatio-temporal compression into its structure. The backbone leverages ViT-Base (12-layer, 768D), interleaving spatial attention with low-latency depth-wise 1D temporal convolutions, and staged strided convolution for spatio-temporal downsampling. This design minimizes redundancy across frames and allows for aggressive reduction in visual tokens—yielding high throughput without the inefficiency of standard frame-wise encoding.

Compressed Token Distillation (CTD)

To transfer information-rich features from a larger teacher (ViT-Large, frame-wise encoding), LiteFrame introduces Compressed Token Distillation (CTD). CTD utilizes Weighted Average Pooling (WAP) to generate compressed, salient supervision targets from the teacher’s dense feature maps. The student encoder is trained to predict these pooled representations, allowing it to bypass redundant computations and directly output highly informative tokens. The distillation objective is a mean-squared error loss between the student’s output and the teacher’s pooled projections, aligning the student’s latent space with spatio-temporal content relevance.

LLM Adaptation (LMA)

Despite achieving strong feature compression and efficiency via CTD, aligning the token space with the LLM’s reasoning requirements demands further adaptation. LMA employs LoRA-based fine-tuning on video-text pairs, optimizing the joint latent space for compatibility with both compressed vision features and extended temporal contexts. This two-stage adaptation is critical for preserving downstream accuracy and enabling inference across up to 512 input frames.

Experimental Results and Numerical Analysis

LiteFrame establishes a new latency-accuracy Pareto frontier across key benchmarks (Video-MME, MLVU, LongVideoBench, HLVid). With only 87M encoder parameters (vs. 304M in teacher models), LiteFrame processes up to 8x more frames under fixed latency budgets and achieves up to a 35% reduction in end-to-end inference time, outperforming baselines including InternVL3-8B and state-of-the-art post-hoc reduction methods (ToMe, FastVID, PruMerge). The model also achieves accuracy improvements of up to +2.1%p under latency constraints and maintains superior performance when scaled to high spatial resolutions, demonstrating robust generalization.

Ablation studies underline the value of:

Token-compressive student design for latency reduction
Depth-wise temporal convolutions as superior for efficient temporal modeling
WAP-based distillation for optimal compression and semantic transfer
LMA for bridging modality gaps and adapting to extended video context

Aggressive spatio-temporal compression—distributing reduction across both spatial and temporal dimensions—consistently outperforms spatial-only compression, preserving critical spatial fidelity for rigorous benchmarks.

Comparison and Implications

LiteFrame’s approach proves more efficient than prior image-centric lightweight vision encoders (FastVLM, VideoPanda) and contemporary dual-stage reduction methods (AutoGaze), as it internalizes compression within the encoder, thereby reducing both vision and LLM latency. The architectural synergy between CTD and LMA unlocks scalability for long-form video processing, enabling Video LLMs to leverage richer temporal context within fixed compute budgets.

Practically, LiteFrame enables real-time or near-real-time inference in multimodal LLMs for complex video understanding. Theoretically, its methodology demonstrates the advantages of architectural internalization of compression via distillation, rather than relying on incremental post-hoc token reduction. This suggests a shift in future research focus—towards designing vision-language encoders that are jointly optimizable and scalable in both spatial and temporal domains.

Limitations and Prospects

While LiteFrame advances efficiency without sacrificing accuracy, open questions persist in further scaling down model size, stabilization of CTD for ultra-lightweight encoders, and zero-shot transferability to static image-only tasks. Enhancement with new, longer-form video datasets could further exploit its expanded temporal window.

Conclusion

LiteFrame internalizes spatio-temporal compression in a lightweight vision encoder backbone through Compressed Token Distillation and LLM Adaptation, replacing redundant full-resolution computation with direct prediction of pooled, information-rich features. This paradigm fundamentally redefines the efficiency-accuracy trade-off in Video LLMs, allowing scalable long-form video understanding by balancing frame counts, spatial detail, and inference latency. Its approach provides both practical gains in multimodal video processing and a new direction for efficient vision-language modeling (2605.17260).

Markdown Report Issue