EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions? (2405.17719v2)

Published 28 May 2024 in cs.CV

Abstract: Egocentric video-language pretraining is a crucial paradigm to advance the learning of egocentric hand-object interactions (EgoHOI). Despite the great success on existing testbeds, these benchmarks focus more on closed-set visual concepts or limited scenarios. Due to the occurrence of diverse EgoHOIs in the real world, we propose an open-vocabulary benchmark named EgoHOIBench to reveal the diminished performance of current egocentric video-LLMs (EgoVLM) on fined-grained concepts, indicating that these models still lack a full spectrum of egocentric understanding. We attribute this performance gap to insufficient fine-grained supervision and strong bias towards understanding objects rather than temporal dynamics in current methods. To tackle these issues, we introduce a novel asymmetric contrastive objective for EgoHOI named EgoNCE++. For video-to-text loss, we enhance text supervision through the generation of negative captions by leveraging the in-context learning of LLMs to perform HOI-related word substitution. For text-to-video loss, we propose an object-centric positive video sampling strategy that aggregates video representations by the same nouns. Our extensive experiments demonstrate that EgoNCE++ significantly boosts open-vocabulary HOI recognition, multi-instance retrieval, and action recognition tasks across various egocentric models, with improvements of up to +26.55%. Our code is available at https://github.com/xuboshen/EgoNCEpp.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (61)

Authors (6)

Boshen Xu (7 papers)
Ziheng Wang (48 papers)
Yang Du (24 papers)
Sipeng Zheng (16 papers)
Zhinan Song (1 paper)
Qin Jin (94 papers)

Citations (1)

View on Semantic Scholar

Tweets

https://twitter.com/CSVisionPapers/status/1795859515453165769

EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions? (2405.17719v2)

Related Papers

Tweets