Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Relation-aware Video Reading Comprehension for Temporal Language Grounding (2110.05717v3)

Published 12 Oct 2021 in cs.CV and cs.CL

Abstract: Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence. Previous methods treat it either as a boundary regression task or a span extraction task. This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it. This framework aims to select a video moment choice from the predefined answer set with the aid of coarse-and-fine choice-query interaction and choice-choice relation construction. A choice-query interactor is proposed to match the visual and textual information simultaneously in sentence-moment and token-moment levels, leading to a coarse-and-fine cross-modal interaction. Moreover, a novel multi-choice relation constructor is introduced by leveraging graph convolution to capture the dependencies among video moment choices for the best choice selection. Extensive experiments on ActivityNet-Captions, TACoS, and Charades-STA demonstrate the effectiveness of our solution. Codes have been available.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jialin Gao (18 papers)
  2. Xin Sun (151 papers)
  3. Mengmeng Xu (27 papers)
  4. Xi Zhou (43 papers)
  5. Bernard Ghanem (256 papers)
Citations (41)

Summary

We haven't generated a summary for this paper yet.