Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Referring to Objects in Videos using Spatio-Temporal Identifying Descriptions (1904.03885v1)

Published 8 Apr 2019 in cs.CV, cs.CL, and cs.LG

Abstract: This paper presents a new task, the grounding of spatio-temporal identifying descriptions in videos. Previous work suggests potential bias in existing datasets and emphasizes the need for a new data creation schema to better model linguistic structure. We introduce a new data collection scheme based on grammatical constraints for surface realization to enable us to investigate the problem of grounding spatio-temporal identifying descriptions in videos. We then propose a two-stream modular attention network that learns and grounds spatio-temporal identifying descriptions based on appearance and motion. We show that motion modules help to ground motion-related words and also help to learn in appearance modules because modular neural networks resolve task interference between modules. Finally, we propose a future challenge and a need for a robust system arising from replacing ground truth visual annotations with automatic video object detector and temporal event localization.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Peratham Wiriyathammabhum (9 papers)
  2. Abhinav Shrivastava (120 papers)
  3. Vlad I. Morariu (31 papers)
  4. Larry S. Davis (98 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.