Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Finding beans in burgers: Deep semantic-visual embedding with localization (1804.01720v2)

Published 5 Apr 2018 in cs.CV, cs.CL, and cs.LG

Abstract: Several works have proposed to learn a two-path neural network that maps images and texts, respectively, to a same shared Euclidean space where geometry captures useful semantic relationships. Such a multi-modal embedding can be trained and used for various tasks, notably image captioning. In the present work, we introduce a new architecture of this type, with a visual path that leverages recent space-aware pooling mechanisms. Combined with a textual path which is jointly trained from scratch, our semantic-visual embedding offers a versatile model. Once trained under the supervision of captioned images, it yields new state-of-the-art performance on cross-modal retrieval. It also allows the localization of new concepts from the embedding space into any input image, delivering state-of-the-art result on the visual grounding of phrases.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Martin Engilberge (10 papers)
  2. Louis Chevallier (8 papers)
  3. Matthieu Cord (129 papers)
  4. Patrick Pérez (90 papers)
Citations (93)