Verbs in Action: Improving verb understanding in video-language models (2304.06708v1)

Published 13 Apr 2023 in cs.CV, cs.AI, and cs.CL

Abstract: Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-LLMs based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting their performance in real-world video applications that require action and temporal understanding. In this work, we improve verb understanding for CLIP-based video-LLMs by proposing a new Verb-Focused Contrastive (VFC) framework. This consists of two main components: (1) leveraging pretrained LLMs to create hard negatives for cross-modal contrastive learning, together with a calibration strategy to balance the occurrence of concepts in positive and negative pairs; and (2) enforcing a fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art results for zero-shot performance on three downstream tasks that focus on verb understanding: video-text matching, video question-answering and video classification. To the best of our knowledge, this is the first work which proposes a method to alleviate the verb understanding problem, and does not simply highlight it.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (5)

Liliane Momeni (14 papers)
Mathilde Caron (25 papers)
Arsha Nagrani (62 papers)
Andrew Zisserman (248 papers)
Cordelia Schmid (206 papers)

Citations (53)

View on Semantic Scholar

Verbs in Action: Improving verb understanding in video-language models (2304.06708v1)

Related Papers