Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Contrastive Video-Language Segmentation (2109.14131v1)

Published 29 Sep 2021 in cs.CV and cs.CL

Abstract: We focus on the problem of segmenting a certain object referred by a natural language sentence in video content, at the core of formulating a pinpoint vision-language relation. While existing attempts mainly construct such relation in an implicit way, i.e., grid-level multi-modal feature fusion, it has been proven problematic to distinguish semantically similar objects under this paradigm. In this work, we propose to interwind the visual and linguistic modalities in an explicit way via the contrastive learning objective, which directly aligns the referred object and the language description and separates the unreferred content apart across frames. Moreover, to remedy for the degradation problem, we present two complementary hard instance mining strategies, i.e., Language-relevant Channel Filter and Relative Hard Instance Construction. They encourage the network to exclude visual-distinguishable feature and to focus on easy-confused objects during the contrastive training. Extensive experiments on two benchmarks, i.e., A2D Sentences and J-HMDB Sentences, quantitatively demonstrate the state-of-the-arts performance of our method and qualitatively show the more accurate distinguishment between semantically similar objects over baselines.

Summary

We haven't generated a summary for this paper yet.