- The paper demonstrates that leveraging token-compression through CTD and LMA can reduce inference time by up to 35% while processing 8x more frames.
- It introduces a novel architecture that interleaves spatial attention with depth-wise temporal convolutions for efficient spatio-temporal downsampling.
- The study validates its approach on multiple benchmarks, showing improved accuracy and efficiency over traditional post-hoc token reduction methods.
LiteFrame: Internalizing Efficient Vision Encoding for Scalable Video LLMs
Motivation and Identified Bottleneck
Recent advances in Video LLMs (Video LLMs) have been primarily limited by the quadratic complexity of visual token processing in LLMs for long-form video inputs. Existing strategies emphasize post-hoc token reduction—compressing tokens after feature extraction via dense vision encoders—to alleviate LLM overhead. However, this paradigm ignores a fundamental scaling bottleneck: as LLM context length increases, the cumulative latency shifts to the vision encoder’s per-frame processing. The dominant architectural flow thus becomes increasingly inefficient, especially for models that must handle significant temporal context. LiteFrame directly targets the inefficiency in visual token extraction, reworking the paradigm to enable scaling with both higher frame counts and reduced latency.
Methodology: LiteFrame Architecture and Training
The LiteFrame encoder is architected as a compact, token-compressive vision transformer, embedding explicit spatio-temporal compression into its structure. The backbone leverages ViT-Base (12-layer, 768D), interleaving spatial attention with low-latency depth-wise 1D temporal convolutions, and staged strided convolution for spatio-temporal downsampling. This design minimizes redundancy across frames and allows for aggressive reduction in visual tokens—yielding high throughput without the inefficiency of standard frame-wise encoding.
Compressed Token Distillation (CTD)
To transfer information-rich features from a larger teacher (ViT-Large, frame-wise encoding), LiteFrame introduces Compressed Token Distillation (CTD). CTD utilizes Weighted Average Pooling (WAP) to generate compressed, salient supervision targets from the teacher’s dense feature maps. The student encoder is trained to predict these pooled representations, allowing it to bypass redundant computations and directly output highly informative tokens. The distillation objective is a mean-squared error loss between the student’s output and the teacher’s pooled projections, aligning the student’s latent space with spatio-temporal content relevance.
LLM Adaptation (LMA)
Despite achieving strong feature compression and efficiency via CTD, aligning the token space with the LLM’s reasoning requirements demands further adaptation. LMA employs LoRA-based fine-tuning on video-text pairs, optimizing the joint latent space for compatibility with both compressed vision features and extended temporal contexts. This two-stage adaptation is critical for preserving downstream accuracy and enabling inference across up to 512 input frames.
Experimental Results and Numerical Analysis
LiteFrame establishes a new latency-accuracy Pareto frontier across key benchmarks (Video-MME, MLVU, LongVideoBench, HLVid). With only 87M encoder parameters (vs. 304M in teacher models), LiteFrame processes up to 8x more frames under fixed latency budgets and achieves up to a 35% reduction in end-to-end inference time, outperforming baselines including InternVL3-8B and state-of-the-art post-hoc reduction methods (ToMe, FastVID, PruMerge). The model also achieves accuracy improvements of up to +2.1%p under latency constraints and maintains superior performance when scaled to high spatial resolutions, demonstrating robust generalization.
Ablation studies underline the value of:
- Token-compressive student design for latency reduction
- Depth-wise temporal convolutions as superior for efficient temporal modeling
- WAP-based distillation for optimal compression and semantic transfer
- LMA for bridging modality gaps and adapting to extended video context
Aggressive spatio-temporal compression—distributing reduction across both spatial and temporal dimensions—consistently outperforms spatial-only compression, preserving critical spatial fidelity for rigorous benchmarks.
Comparison and Implications
LiteFrame’s approach proves more efficient than prior image-centric lightweight vision encoders (FastVLM, VideoPanda) and contemporary dual-stage reduction methods (AutoGaze), as it internalizes compression within the encoder, thereby reducing both vision and LLM latency. The architectural synergy between CTD and LMA unlocks scalability for long-form video processing, enabling Video LLMs to leverage richer temporal context within fixed compute budgets.
Practically, LiteFrame enables real-time or near-real-time inference in multimodal LLMs for complex video understanding. Theoretically, its methodology demonstrates the advantages of architectural internalization of compression via distillation, rather than relying on incremental post-hoc token reduction. This suggests a shift in future research focus—towards designing vision-language encoders that are jointly optimizable and scalable in both spatial and temporal domains.
Limitations and Prospects
While LiteFrame advances efficiency without sacrificing accuracy, open questions persist in further scaling down model size, stabilization of CTD for ultra-lightweight encoders, and zero-shot transferability to static image-only tasks. Enhancement with new, longer-form video datasets could further exploit its expanded temporal window.
Conclusion
LiteFrame internalizes spatio-temporal compression in a lightweight vision encoder backbone through Compressed Token Distillation and LLM Adaptation, replacing redundant full-resolution computation with direct prediction of pooled, information-rich features. This paradigm fundamentally redefines the efficiency-accuracy trade-off in Video LLMs, allowing scalable long-form video understanding by balancing frame counts, spatial detail, and inference latency. Its approach provides both practical gains in multimodal video processing and a new direction for efficient vision-language modeling (2605.17260).