Finding beans in burgers: Deep semantic-visual embedding with localization (1804.01720v2)

Published 5 Apr 2018 in cs.CV, cs.CL, and cs.LG

Abstract: Several works have proposed to learn a two-path neural network that maps images and texts, respectively, to a same shared Euclidean space where geometry captures useful semantic relationships. Such a multi-modal embedding can be trained and used for various tasks, notably image captioning. In the present work, we introduce a new architecture of this type, with a visual path that leverages recent space-aware pooling mechanisms. Combined with a textual path which is jointly trained from scratch, our semantic-visual embedding offers a versatile model. Once trained under the supervision of captioned images, it yields new state-of-the-art performance on cross-modal retrieval. It also allows the localization of new concepts from the embedding space into any input image, delivering state-of-the-art result on the visual grounding of phrases.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (4)

Martin Engilberge (10 papers)
Louis Chevallier (8 papers)
Matthieu Cord (129 papers)
Patrick Pérez (90 papers)

Citations (93)

View on Semantic Scholar

Finding beans in burgers: Deep semantic-visual embedding with localization (1804.01720v2)

Related Papers