Open-Vocabulary Object Detection with Meta Prompt Representation and Instance Contrastive Optimization (2403.09433v1)
Abstract: Classical object detectors are incapable of detecting novel class objects that are not encountered before. Regarding this issue, Open-Vocabulary Object Detection (OVOD) is proposed, which aims to detect the objects in the candidate class list. However, current OVOD models are suffering from overfitting on the base classes, heavily relying on the large-scale extra data, and complex training process. To overcome these issues, we propose a novel framework with Meta prompt and Instance Contrastive learning (MIC) schemes. Firstly, we simulate a novel-class-emerging scenario to help the prompt learner that learns class and background prompts generalize to novel classes. Secondly, we design an instance-level contrastive strategy to promote intra-class compactness and inter-class separation, which benefits generalization of the detector to novel class objects. Without using knowledge distillation, ensemble model or extra training data during detector training, our proposed MIC outperforms previous SOTA methods trained with these complex techniques on LVIS. Most importantly, MIC shows great generalization ability on novel classes, e.g., with $+4.3\%$ and $+1.9\% \ \mathrm{AP}$ improvement compared with previous SOTA on COCO and Objects365, respectively.
- Bridging the gap between object and image-level representations for open-vocabulary detection. Advances in Neural Information Processing Systems, 35:33781–33794, 2022.
- Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33:9912–9924, 2020.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
- Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
- Learning to prompt for open-vocabulary object detection with vision-language model. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
- Promptdet: Towards open-vocabulary detection using uncurated images. In Proceedings of the European Conference on Computer Vision, 2022.
- Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2918–2928, 2021.
- Semi-supervised learning by entropy minimization. Advances in neural information processing systems, 17, 2004.
- Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
- Open-vocabulary object detection via vision and language knowledge distillation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=lL3lnMbR4WU.
- Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019.
- Ow-detr: Open-world detection transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9235–9244, 2022.
- Expanding low-density latent regions for open-set object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9591–9600, 2022.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
- Robust region feature synthesizer for zero-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7622–7631, 2022.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
- Towards open world object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5830–5840, 2021.
- Supervised contrastive learning. Advances in Neural Information Processing Systems, 33:18661–18673, 2020.
- Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
- Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
- Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017a.
- Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017b.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3651–3660, 2021.
- Simple open-vocabulary object detection with vision transformers. Proceedings of the European Conference on Computer Vision, 2022.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
- Imagenet-21k pretraining for the masses. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
- Self-supervised object localization with joint graph partition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2289–2297, 2022.
- Fsce: Few-shot object detection via contrastive proposal encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7352–7362, 2021.
- Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10781–10790, 2020.
- Proposal learning for semi-supervised object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2291–2301, 2021.
- Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9627–9636, 2019.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- C2am loss: Chasing a better decision boundary for long-tail object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6980–6989, 2022.
- Detectron2. https://github.com/facebookresearch/detectron2, 2019.
- Open-vocabulary detr with conditional matching. In European Conference on Computer Vision, 2022.
- Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793–16803, 2022.
- Learning with local and global consistency. Advances in neural information processing systems, 16, 2003.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 2022a.
- Conditional prompt learning for vision-language models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022b.
- Probabilistic two-stage detection. arXiv preprint arXiv:2103.07461, 2021.
- Detecting twenty-thousand classes using image-level supervision. European Conference on Computer Vision, 2022c.
- Deformable {detr}: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=gZ9hCDWe6ke.
- Zhao Wang (154 papers)
- Aoxue Li (22 papers)
- Fengwei Zhou (21 papers)
- Zhenguo Li (195 papers)
- Qi Dou (163 papers)