Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding (2303.16341v3)

Published 28 Mar 2023 in cs.CV

Abstract: Existing video-language pre-training methods primarily focus on instance-level alignment between video clips and captions via global contrastive learning but neglect rich fine-grained local information in both videos and text, which is of importance to downstream tasks requiring temporal localization and semantic reasoning. A powerful model is expected to be capable of capturing region-object correspondences and recognizing scene changes in a video clip, reflecting spatial and temporal granularity, respectively. To strengthen model's understanding into such fine-grained details, we propose a simple yet effective video-LLMing framework, S-ViLM, by exploiting the intrinsic structures of these two modalities. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features, simultaneously. Comprehensive evaluations demonstrate that S-ViLM performs favorably against existing approaches in learning more expressive representations. Specifically, S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks, covering text-video retrieval, video question answering, video action recognition, and temporal action localization.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yuanhao Xiong (12 papers)
  2. Long Zhao (64 papers)
  3. Boqing Gong (100 papers)
  4. Ming-Hsuan Yang (376 papers)
  5. Florian Schroff (21 papers)
  6. Ting Liu (329 papers)
  7. Cho-Jui Hsieh (211 papers)
  8. Liangzhe Yuan (19 papers)