Contrastive Learning for Weakly Supervised Phrase Grounding (2006.09920v3)

Published 17 Jun 2020 in cs.CV, cs.CL, cs.LG, and stat.ML

Abstract: Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words. Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions. A key idea is to construct effective negative captions for learning through LLM guided word substitutions. Training with our negatives yields a $\sim10\%$ absolute gain in accuracy over randomly-sampled negatives from the training data. Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of $5.7\%$ to achieve $76.7\%$ accuracy on Flickr30K Entities benchmark.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (6)

Tanmay Gupta (23 papers)
Arash Vahdat (69 papers)
Gal Chechik (110 papers)
Xiaodong Yang (101 papers)
Jan Kautz (215 papers)
Derek Hoiem (50 papers)

Citations (133)

View on Semantic Scholar

Tweets

https://twitter.com/antoine_chaffin/status/1761903368379809889

Contrastive Learning for Weakly Supervised Phrase Grounding (2006.09920v3)

Related Papers

Tweets