Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition (2403.13805v1)

Published 20 Mar 2024 in cs.CV, cs.AI, and cs.LG

Abstract: CLIP (Contrastive Language-Image Pre-training) uses contrastive learning from noise image-text pairs to excel at recognizing a wide array of candidates, yet its focus on broad associations hinders the precision in distinguishing subtle differences among fine-grained items. Conversely, Multimodal LLMs (MLLMs) excel at classifying fine-grained categories, thanks to their substantial knowledge from pre-training on web-level corpora. However, the performance of MLLMs declines with an increase in category numbers, primarily due to growing complexity and constraints of limited context window size. To synergize the strengths of both approaches and enhance the few-shot/zero-shot recognition abilities for datasets characterized by extensive and fine-grained vocabularies, this paper introduces RAR, a Retrieving And Ranking augmented method for MLLMs. We initially establish a multi-modal retriever based on CLIP to create and store explicit memory for different categories beyond the immediate context window. During inference, RAR retrieves the top-k similar results from the memory and uses MLLMs to rank and make the final predictions. Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base, significantly boosting accuracy across a range of vision-language recognition tasks. Notably, our approach demonstrates a significant improvement in performance on 5 fine-grained visual recognition benchmarks, 11 few-shot image recognition datasets, and the 2 object detection datasets under the zero-shot recognition setting.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Ziyu Liu (47 papers)
  2. Zeyi Sun (16 papers)
  3. Yuhang Zang (54 papers)
  4. Wei Li (1121 papers)
  5. Pan Zhang (153 papers)
  6. Xiaoyi Dong (73 papers)
  7. Yuanjun Xiong (52 papers)
  8. Dahua Lin (336 papers)
  9. Jiaqi Wang (218 papers)
Citations (8)
X Twitter Logo Streamline Icon: https://streamlinehq.com