Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval (2203.05465v1)

Published 10 Mar 2022 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Dual encoders and cross encoders have been widely used for image-text retrieval. Between the two, the dual encoder encodes the image and text independently followed by a dot product, while the cross encoder jointly feeds image and text as the input and performs dense multi-modal fusion. These two architectures are typically modeled separately without interaction. In this work, we propose LoopITR, which combines them in the same network for joint learning. Specifically, we let the dual encoder provide hard negatives to the cross encoder, and use the more discriminative cross encoder to distill its predictions back to the dual encoder. Both steps are efficiently performed together in the same model. Our work centers on empirical analyses of this combined architecture, putting the main focus on the design of the distillation objective. Our experimental results highlight the benefits of training the two encoders in the same network, and demonstrate that distillation can be quite effective with just a few hard negative examples. Experiments on two standard datasets (Flickr30K and COCO) show our approach achieves state-of-the-art dual encoder performance when compared with approaches using a similar amount of data.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jie Lei (52 papers)
  2. Xinlei Chen (106 papers)
  3. Ning Zhang (278 papers)
  4. Mengjiao Wang (15 papers)
  5. Mohit Bansal (304 papers)
  6. Tamara L. Berg (26 papers)
  7. Licheng Yu (47 papers)
Citations (11)