Context-Aware Integration of Language and Visual References for Natural Language Tracking (2403.19975v1)

Published 29 Mar 2024 in cs.CV

Abstract: Tracking by natural language specification (TNL) aims to consistently localize a target in a video sequence given a linguistic description in the initial frame. Existing methodologies perform language-based and template-based matching for target reasoning separately and merge the matching results from two sources, which suffer from tracking drift when language and visual templates miss-align with the dynamic target state and ambiguity in the later merging stage. To tackle the issues, we propose a joint multi-modal tracking framework with 1) a prompt modulation module to leverage the complementarity between temporal visual templates and language expressions, enabling precise and context-aware appearance and linguistic cues, and 2) a unified target decoding module to integrate the multi-modal reference cues and executes the integrated queries on the search image to predict the target location in an end-to-end manner directly. This design ensures spatio-temporal consistency by leveraging historical visual information and introduces an integrated solution, generating predictions in a single step. Extensive experiments conducted on TNL2K, OTB-Lang, LaSOT, and RefCOCOg validate the efficacy of our proposed approach. The results demonstrate competitive performance against state-of-the-art methods for both tracking and grounding.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (43)

Authors (6)

Yanyan Shao (6 papers)
Shuting He (23 papers)
Qi Ye (67 papers)
Yuchao Feng (6 papers)
Wenhan Luo (88 papers)
Jiming Chen (105 papers)

Citations (5)

View on Semantic Scholar

Tweets

https://twitter.com/CSVisionPapers/status/1774848578193657913

Context-Aware Integration of Language and Visual References for Natural Language Tracking (2403.19975v1)

Related Papers

Tweets