Retrieval-Augmented Open-Vocabulary Object Detection (2404.05687v1)
Abstract: Open-vocabulary object detection (OVD) has been studied with Vision-LLMs (VLMs) to detect novel objects beyond the pre-trained categories. Previous approaches improve the generalization ability to expand the knowledge of the detector, using 'positive' pseudo-labels with additional 'class' names, e.g., sock, iPod, and alligator. To extend the previous methods in two aspects, we propose Retrieval-Augmented Losses and visual Features (RALF). Our method retrieves related 'negative' classes and augments loss functions. Also, visual features are augmented with 'verbalized concepts' of classes, e.g., worn on the feet, handheld music player, and sharp teeth. Specifically, RALF consists of two modules: Retrieval Augmented Losses (RAL) and Retrieval-Augmented visual Features (RAF). RAL constitutes two losses reflecting the semantic similarity with negative vocabularies. In addition, RAF augments visual features with the verbalized concepts from a LLM. Our experiments demonstrate the effectiveness of RALF on COCO and LVIS benchmark datasets. We achieve improvement up to 3.4 box AP${50}{\text{N}}$ on novel categories of the COCO dataset and 3.6 mask AP${\text{r}}$ gains on the LVIS dataset. Code is available at https://github.com/mlvlab/RALF .
- Retrieval-augmented diffusion models. In NeurIPS, 2022.
- Language models are few-shot learners. In NeurIPS, 2020.
- Open vocabulary object detection with proposal mining and prediction equalization. arXiv preprint arXiv:2206.11134, 2022.
- Open-vocabulary object detection using pseudo caption labels. arXiv preprint arXiv:2303.13040, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
- Learning to prompt for open-vocabulary object detection with vision-language model. In CVPR, 2022.
- Open vocabulary object detection with pseudo bounding-box labels. In ECCV, 2022.
- Ross Girshick. Fast r-cnn. In ICCV, 2015.
- Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
- Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
- Mask r-cnn. In ICCV, 2017.
- Deep residual learning for image recognition. In CVPR, 2016.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Multi-modal classifiers for open-vocabulary object detection. In ICML, 2023.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. In NeurIPS, 2020.
- Language-driven semantic segmentation. In ICLR, 2022.
- Learning object-language alignments for open-vocabulary object detection. In ICLR, 2023.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Retrieval augmented classification for long-tail visual recognition. In CVPR, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Retrieval-augmented image captioning. In EACL, 2023.
- Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR, 2022.
- Bridging the gap between object and image-level representations for open-vocabulary detection. In NeurIPS, 2022.
- Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
- Retrievegan: Image synthesis via differentiable patch retrieval. In ECCV, 2020.
- V3det: Vast vocabulary visual detection dataset. In ICCV, 2023.
- Object-aware distillation pyramid for open-vocabulary object detection. In CVPR, 2023.
- Clip-gen: Language-free training of a text-to-image generator with clip. arXiv preprint arXiv:2203.00386, 2022.
- Aligning bag of regions for open-vocabulary object detection. In CVPR, 2023.
- Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In CVPR, 2023.
- Texture memory-augmented deep patch-based image inpainting. TIP, 2021.
- Open-vocabulary detr with conditional matching. In ECCV, 2022.
- Open-vocabulary object detection using captions. In CVPR, 2021.
- Exploiting unlabeled data with vision and language models for object detection. In ECCV, 2022.
- Regionclip: Region-based language-image pretraining. In CVPR, 2022.
- Extract free dense labels from clip. In ECCV, 2022.
- Jooyeon Kim (8 papers)
- Eulrang Cho (4 papers)
- Sehyung Kim (4 papers)
- Hyunwoo J. Kim (70 papers)