Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 77 tok/s

Gemini 2.5 Pro 32 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 36 tok/s Pro

GPT-4o 100 tok/s Pro

Kimi K2 197 tok/s Pro

GPT OSS 120B 450 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding (2504.20384v1)

Published 29 Apr 2025 in cs.CV

Abstract: Recent advancements in video understanding within visual LLMs (VLLMs) have led to notable progress. However, the complexity of video data and contextual processing limitations still hinder long-video comprehension. A common approach is video feature compression to reduce token input to LLMs, yet many methods either fail to prioritize essential features, leading to redundant inter-frame information, or introduce computationally expensive modules.To address these issues, we propose FiLA(Fine-grained Vision LLM)-Video, a novel framework that leverages a lightweight dynamic-weight multi-frame fusion strategy, which adaptively integrates multiple frames into a single representation while preserving key video information and reducing computational costs. To enhance frame selection for fusion, we introduce a keyframe selection strategy, effectively identifying informative frames from a larger pool for improved summarization. Additionally, we present a simple yet effective long-video training data generation strategy, boosting model performance without extensive manual annotation. Experimental results demonstrate that FiLA-Video achieves superior efficiency and accuracy in long-video comprehension compared to existing methods.