Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos (1901.06829v1)

Published 21 Jan 2019 in cs.CV

Abstract: The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos. Existing studies have adopted strategies of sliding window over the entire video or exhaustively ranking all possible clip-sentence pairs in a pre-segmented video, which inevitably suffer from exhaustively enumerated candidates. To alleviate this problem, we formulate this task as a problem of sequential decision making by learning an agent which regulates the temporal grounding boundaries progressively based on its policy. Specifically, we propose a reinforcement learning based framework improved by multi-task learning and it shows steady performance gains by considering additional supervised boundary information during training. Our proposed framework achieves state-of-the-art performance on ActivityNet'18 DenseCaption dataset and Charades-STA dataset while observing only 10 or less clips per video.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Dongliang He (46 papers)
  2. Xiang Zhao (60 papers)
  3. Jizhou Huang (26 papers)
  4. Fu Li (86 papers)
  5. Xiao Liu (402 papers)
  6. Shilei Wen (42 papers)
Citations (148)

Summary

We haven't generated a summary for this paper yet.