Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Referring to Objects in Videos using Spatio-Temporal Identifying Descriptions (1904.03885v1)

Published 8 Apr 2019 in cs.CV, cs.CL, and cs.LG

Abstract: This paper presents a new task, the grounding of spatio-temporal identifying descriptions in videos. Previous work suggests potential bias in existing datasets and emphasizes the need for a new data creation schema to better model linguistic structure. We introduce a new data collection scheme based on grammatical constraints for surface realization to enable us to investigate the problem of grounding spatio-temporal identifying descriptions in videos. We then propose a two-stream modular attention network that learns and grounds spatio-temporal identifying descriptions based on appearance and motion. We show that motion modules help to ground motion-related words and also help to learn in appearance modules because modular neural networks resolve task interference between modules. Finally, we propose a future challenge and a need for a robust system arising from replacing ground truth visual annotations with automatic video object detector and temporal event localization.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Peratham Wiriyathammabhum (9 papers)
  2. Abhinav Shrivastava (120 papers)
  3. Vlad I. Morariu (31 papers)
  4. Larry S. Davis (98 papers)
Citations (4)