A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future (2307.09220v2)
Abstract: As the most fundamental scene understanding tasks, object detection and segmentation have made tremendous progress in deep learning era. Due to the expensive manual labeling cost, the annotated categories in existing datasets are often small-scale and pre-defined, i.e., state-of-the-art fully-supervised detectors and segmentors fail to generalize beyond the closed vocabulary. To resolve this limitation, in the last few years, the community has witnessed an increasing attention toward Open-Vocabulary Detection (OVD) and Segmentation (OVS). By ``open-vocabulary'', we mean that the models can classify objects beyond pre-defined categories. In this survey, we provide a comprehensive review on recent developments of OVD and OVS. A taxonomy is first developed to organize different tasks and methodologies. We find that the permission and usage of weak supervision signals can well discriminate different methodologies, including: visual-semantic space mapping, novel visual feature synthesis, region-aware training, pseudo-labeling, knowledge distillation, and transfer learning. The proposed taxonomy is universal across different tasks, covering object detection, semantic/instance/panoptic segmentation, 3D and video understanding. The main design principles, key challenges, development routes, methodology strengths, and weaknesses are thoroughly analyzed. In addition, we benchmark each task along with the vital components of each method in appendix and updated online at https://github.com/seanzhuh/awesome-open-vocabulary-detection-and-segmentation. Finally, several promising directions are provided and discussed to stimulate future research.
- H. Caesar, V. Bankiti et al., “nuscenes: A multimodal dataset for autonomous driving,” in CVPR, 2020.
- Z. Li, W. Wang et al., “Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,” in ECCV, 2022.
- F. Zhu, Y. Zhu et al., “Deep learning for embodied vision navigation: A survey,” arXiv, 2021.
- P. Anderson, Q. Wu et al., “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in CVPR, 2018.
- S. Ren, K. He et al., “Faster r-cnn: Towards real-time object detection with region proposal networks,” NeurIPS, 2015.
- T.-Y. Lin, P. Dollár et al., “Feature pyramid networks for object detection,” in CVPR, 2017.
- I. Misra, R. Girdhar et al., “An end-to-end transformer model for 3d object detection,” in ICCV, 2021.
- J. Long, E. Shelhamer et al., “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
- B. Cheng, A. Schwing et al., “Per-pixel classification is not all you need for semantic segmentation,” NeurIPS, 2021.
- K. He, G. Gkioxari et al., “Mask r-cnn,” in ICCV, 2017.
- A. Kirillov, K. He et al., “Panoptic segmentation,” in CVPR, 2019.
- T.-Y. Lin, P. Goyal et al., “Focal loss for dense object detection,” in ICCV, 2017.
- N. Carion, F. Massa et al., “End-to-end object detection with transformers,” in ECCV, 2020.
- B. Cheng, I. Misra et al., “Masked-attention mask transformer for universal image segmentation,” in CVPR, 2022.
- M. Everingham, S. A. Eslami et al., “The pascal visual object classes challenge: A retrospective,” IJCV, 2015.
- T.-Y. Lin, M. Maire et al., “Microsoft coco: Common objects in context,” in ECCV, 2014.
- A. Gupta, P. Dollar et al., “Lvis: A dataset for large vocabulary instance segmentation,” in CVPR, 2019.
- A. Bansal, K. Sikka et al., “Zero-shot object detection,” in ECCV, 2018.
- S. Rahman, S. Khan et al., “Zero-shot object detection: Learning to simultaneously recognize and localize novel concepts,” in ACCV, 2019.
- P. Zhu, H. Wang et al., “Zero shot detection,” TCSVT, 2019.
- M. Bucher, T.-H. Vu et al., “Zero-shot semantic segmentation,” NeurIPS, 2019.
- Y. Zheng, J. Wu et al., “Zero-shot instance segmentation,” in CVPR, 2021.
- T. Mikolov, I. Sutskever et al., “Distributed representations of words and phrases and their compositionality,” NeurIPS, 2013.
- J. Pennington, R. Socher et al., “Glove: Global vectors for word representation,” in EMNLP, 2014.
- J. Devlin, M.-W. Chang et al., “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv, 2018.
- X. Gu, T.-Y. Lin et al., “Open-vocabulary object detection via vision and language knowledge distillation,” arXiv, 2021.
- A. Zareian, K. D. Rosa et al., “Open-vocabulary object detection using captions,” in CVPR, 2021.
- D. Huynh, J. Kuen et al., “Open-vocabulary instance segmentation via robust cross-modal pseudo-labeling,” in CVPR, 2022.
- G. Ghiasi, X. Gu et al., “Scaling open-vocabulary image segmentation with image-level labels,” in ECCV, 2022.
- B. Li, K. Q. Weinberger et al., “Language-driven semantic segmentation,” arXiv, 2022.
- A. Radford, J. W. Kim et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
- S. Rahman, S. Khan et al., “Improved visual-semantic alignment for zero-shot object detection,” in AAAI, 2020.
- D. Gupta, A. Anantharaman et al., “A multi-space approach to zero-shot object detection,” in WACV, 2020.
- P. Zhu, H. Wang et al., “Don’t even look once: Synthesizing features for zero-shot detection,” in CVPR, 2020.
- S. Zhao, C. Gao et al., “Gtnet: Generative transfer network for zero-shot object detection,” in AAAI, 2020.
- P. Huang, J. Han et al., “Robust region feature synthesizer for zero-shot object detection,” in CVPR, 2022.
- Y. Xian, S. Choudhury et al., “Semantic projection network for zero-and few-label semantic segmentation,” in CVPR, 2019.
- D. Baek, Y. Oh et al., “Exploiting a joint embedding space for generalized zero-shot semantic segmentation,” in ICCV, 2021.
- H. Zhang and H. Ding, “Prototypical matching and open set rejection for zero-shot semantic segmentation,” in ICCV, 2021.
- Z. Gu, S. Zhou et al., “Context-aware feature generation for zero-shot semantic segmentation,” in ACM MM, 2020.
- A. Kamath, M. Singh et al., “Mdetr - modulated detection for end-to-end multi-modal understanding,” in ICCV, 2021.
- C. Lin, P. Sun et al., “Learning object-language alignments for open-vocabulary object detection,” arXiv, 2022.
- Y. Zang, W. Li et al., “Open-vocabulary detr with conditional matching,” in ECCV, 2022.
- L. H. Li, P. Zhang et al., “Grounded language-image pre-training,” in CVPR, 2022.
- M. Gao, C. Xing et al., “Open vocabulary object detection with pseudo bounding-box labels,” in ECCV, 2022.
- X. Zhou, R. Girdhar et al., “Detecting twenty-thousand classes using image-level supervision,” in ECCV, 2022.
- Y. Du, F. Wei et al., “Learning to prompt for open-vocabulary object detection with vision-language model,” in CVPR, 2022.
- S. Wu, W. Zhang et al., “Aligning bag of regions for open-vocabulary object detection,” in CVPR, 2023.
- W. Kuo, Y. Cui et al., “F-vlm: Open-vocabulary object detection upon frozen vision and language models,” arXiv, 2022.
- J. Xu, S. De Mello et al., “Groupvit: Semantic segmentation emerges from text supervision,” in CVPR, 2022.
- N. Zabari and Y. Hoshen, “Open-vocabulary semantic segmentation using test-time distillation,” in ECCV, 2022.
- K. Han, Y. Liu et al., “Global knowledge calibration for fast open-vocabulary segmentation,” arXiv, 2023.
- H. Wang, P. K. A. Vasu et al., “Sam-clip: Merging vision foundation models towards semantic and spatial understanding,” arXiv, 2023.
- J. Ding, N. Xue et al., “Decoupling zero-shot semantic segmentation,” in CVPR, 2022.
- F. Liang, B. Wu et al., “Open-vocabulary semantic segmentation with mask-adapted clip,” in CVPR, 2023.
- M. Xu, Z. Zhang et al., “Side adapter network for open-vocabulary semantic segmentation,” in CVPR, 2023.
- J. Wu, X. Li et al., “Betrayed by captions: Joint caption grounding and generation for open vocabulary instance segmentation,” arXiv, 2023.
- H. Yuan, X. Li et al., “Open-vocabulary sam: Segment and recognize twenty-thousand classes interactively,” arXiv, 2024.
- Y. Shen, C. Fu et al., “Aligning and prompting everything all at once for universal visual perception,” 2024.
- S. He, H. Ding et al., “Primitive generation and semantic-related alignment for universal zero-shot segmentation,” in CVPR, 2023.
- Q. Yu, J. He et al., “Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,” NeurIPS, 2024.
- J. Xu, S. Liu et al., “Open-vocabulary panoptic segmentation with text-to-image diffusion models,” in CVPR, 2023.
- B. Michele, A. Boulch et al., “Generative zero-shot learning for semantic segmentation of 3d point clouds,” in 3DV, 2021.
- B. Liu, S. Deng et al., “Language-level semantics conditioned 3d point cloud segmentation,” arXiv, 2021.
- Y. Lu, C. Xu et al., “Open-vocabulary point-cloud object detection without 3d annotation,” in CVPR, 2023.
- D. Zhang, C. Li et al., “Fm-ov3d: Foundation model-based cross-modal knowledge blending for open-vocabulary 3d detection,” arXiv, 2023.
- Y. Cao, Z. Yihan et al., “Coda: Collaborative novel box discovery and cross-modal alignment for open-vocabulary 3d object detection,” NeurIPS, 2024.
- S. Peng, K. Genova et al., “Openscene: 3d scene understanding with open vocabularies,” in CVPR, 2023.
- P. Guo, T. Huang et al., “Openvis: Open-vocabulary video instance segmentation,” arXiv, 2023.
- A. Joulin, E. Grave et al., “Bag of tricks for efficient text classification,” arXiv, 2016.
- Y. Li, K. Swersky et al., “Generative moment matching networks,” in ICML, 2015.
- I. Goodfellow, J. Pouget-Abadie et al., “Generative adversarial networks,” ACM Communications, 2020.
- K. Sohn, H. Lee et al., “Learning structured output representation using deep conditional generative models,” NeurIPS, 2015.
- R. Krishna, Y. Zhu et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” IJCV, 2017.
- L. Yu, P. Poirson et al., “Modeling context in referring expressions,” in ECCV, 2016.
- V. K. Nagaraja, V. I. Morariu et al., “Modeling context between objects for referring expression understanding,” in ECCV, 2016.
- M. Jia, L. Tang et al., “Visual prompt tuning,” in ECCV, 2022.
- R. Zhang, R. Fang et al., “Tip-adapter: Training-free clip-adapter for better vision-language modeling,” arXiv, 2021.
- Y.-L. Sung, J. Cho et al., “Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks,” in CVPR, 2022.
- S. Rahman, S. Khan et al., “Transductive learning for zero-shot object detection,” in ICCV, 2019.
- S. Kazemzadeh, V. Ordonez et al., “Referitgame: Referring to objects in photographs of natural scenes,” in EMNLP, 2014.
- B. A. Plummer, L. Wang et al., “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in IJCV, 2015.
- C. Zhu, Y. Zhou et al., “Seqtr: A simple yet universal network for visual grounding,” in ECCV, 2022.
- D. Zhang, J. Han et al., “Weakly supervised object localization and detection: A survey,” TPAMI, 2021.
- H. Bilen and A. Vedaldi, “Weakly supervised deep detection networks,” in CVPR, 2016.
- A. Dhamija, M. Gunther et al., “The overlooked elephant of object detection: Open set,” in WACV, 2020.
- D. Miller, L. Nicholson et al., “Dropout sampling for robust object detection in open-set conditions,” in ICRA, 2018.
- T. Pham, T.-T. Do et al., “Bayesian semantic instance segmentation in open set world,” in ECCV, 2018.
- J. Hwang, S. W. Oh et al., “Exemplar-based open-set panoptic segmentation network,” in CVPR, 2021.
- W. J. Scheirer, A. de Rezende Rocha et al., “Toward open set recognition,” TPAMI, 2012.
- C. Geng, S.-j. Huang et al., “Recent advances in open set recognition: A survey,” TPAMI, 2020.
- K. Joseph, S. Khan et al., “Towards open world object detection,” in CVPR, 2021.
- A. Gupta, S. Narayan et al., “Ow-detr: Open-world detection transformer,” in CVPR, 2022.
- J. Cen, P. Yun et al., “Deep metric learning for open world semantic segmentation,” in ICCV, 2021.
- W. Liu, X. Wang et al., “Energy-based out-of-distribution detection,” NeurIPS, 2020.
- J. Yang, K. Zhou et al., “Generalized out-of-distribution detection: A survey,” arXiv, 2021.
- R. Girshick, “Fast r-cnn,” in ICCV, 2015.
- Z. Tian, C. Shen et al., “Fcos: Fully convolutional one-stage object detection,” in ICCV, 2019.
- X. Zhu, W. Su et al., “Deformable detr: Deformable transformers for end-to-end object detection,” in ICLR, 2021.
- L.-C. Chen, G. Papandreou et al., “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” TPAMI, 2017.
- L.-C. Chen, Y. Zhu et al., “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in ECCV, 2018.
- C. Jia, Y. Yang et al., “Scaling up visual and vision-language representation learning with noisy text supervision,” in ICML, 2021.
- M. Caron, H. Touvron et al., “Emerging properties in self-supervised vision transformers,” in ICCV, 2021.
- M. Oquab, T. Darcet et al., “Dinov2: Learning robust visual features without supervision,” arXiv, 2023.
- K. He, X. Chen et al., “Masked autoencoders are scalable vision learners,” in CVPR, 2022.
- M. Yuksekgonul, F. Bianchi et al., “When and why vision-language models behave like bags-of-words, and what to do about it?” in ICLR, 2022.
- J. Ho, A. Jain et al., “Denoising diffusion probabilistic models,” NeurIPS, 2020.
- R. Rombach, A. Blattmann et al., “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022.
- Y. Li, H. Wang et al., “Clip surgery for better explainability with enhancement in open-vocabulary tasks,” arXiv, 2023.
- S. Wu, W. Zhang et al., “CLIPSelf: Vision transformer distills itself for open-vocabulary dense prediction,” in ICLR, 2024.
- K. Zhou, J. Yang et al., “Learning to prompt for vision-language models,” IJCV, 2022.
- G. A. Miller, “Wordnet: a lexical database for english,” ACM Communications, 1995.
- S. Rahman, S. H. Khan et al., “Zero-shot object detection: Joint recognition and localization of novel concepts,” IJCV, 2020.
- R. Luo, N. Zhang et al., “Context-aware zero-shot recognition,” in AAAI, 2020.
- Z. Li, L. Yao et al., “Zero-shot object detection with textual descriptions,” in AAAI, 2019.
- Y. Zheng, R. Huang et al., “Background learnable cascade for zero-shot object detection,” in ACCV, 2020.
- S. Khandelwal, A. Nambirajan et al., “Frustratingly simple but effective zero-shot detection and segmentation: Analysis and a strong baseline,” arXiv, 2023.
- J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in CVPR, 2017.
- ——, “Yolov3: An incremental improvement,” arXiv, 2018.
- B. Demirel, R. G. Cinbis et al., “Zero-shot object detection by hybrid region embedding,” arXiv, 2018.
- Y. Li, Y. Shao et al., “Context-guided super-class inference for zero-shot detection,” in CVPRW, 2020.
- Y. Li, P. Li et al., “Inference fusion with associative semantics for unseen object detection,” in AAAI, 2021.
- C. Yan, X. Chang et al., “Semantics-guided contrastive network for zero-shot object detection,” TPAMI, 2022.
- L. Zhang, X. Wang et al., “Zero-shot object detection via learning an embedding from semantic space to visual space,” in IJCAI, 2020.
- G. Dinu, A. Lazaridou et al., “Improving zero-shot learning by mitigating the hubness problem,” arXiv, 2014.
- N. Hayat, M. Hayat et al., “Synthesizing the unseen for zero-shot object detection,” in ACCV, 2020.
- Q. Mao, H.-Y. Lee et al., “Mode seeking generative adversarial networks for diverse image synthesis,” in CVPR, 2019.
- M. Arjovsky, S. Chintala et al., “Wasserstein generative adversarial networks,” in ICML, 2017.
- P. Hu, S. Sclaroff et al., “Uncertainty-aware learning for zero-shot semantic segmentation,” NeurIPS, 2020.
- N. Kato, T. Yamasaki et al., “Zero-shot semantic segmentation via variational mapping,” in ICCVW, 2019.
- K. Wang, J. H. Liew et al., “Panet: Few-shot image semantic segmentation with prototype alignment,” in ICCV, 2019.
- P. Li, Y. Wei et al., “Consistent structural relation learning for zero-shot segmentation,” NeurIPS, 2020.
- J. Cheng, S. Nandi et al., “Sign: Spatial-information incorporated generative network for generalized zero-shot semantic segmentation,” in ICCV, 2021.
- P. Sharma, N. Ding et al., “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in ACL, 2018.
- S. Changpinyo, P. Sharma et al., “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in CVPR, 2021.
- X. Chen, H. Fang et al., “Microsoft coco captions: Data collection and evaluation server,” arXiv, 2015.
- M. A. Bravo, S. Mittal et al., “Localized vision-language matching for open-vocabulary object detection,” in DAGM GCPR, 2022.
- Z. Huang, Z. Zeng et al., “Pixel-bert: Aligning image pixels with text by deep multi-modal transformers,” arXiv, 2020.
- J. Lu, D. Batra et al., “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” NeurIPS, 2019.
- W. Kim, B. Son et al., “Vilt: Vision-and-language transformer without convolution or region supervision,” in ICML, 2021.
- J. Li, R. Selvaraju et al., “Align before fuse: Vision and language representation learning with momentum distillation,” NeurIPS, 2021.
- Y. Xu, M. Zhang et al., “Exploring multi-modal contextual knowledge for open-vocabulary object detection,” arXiv, 2023.
- L. Yao, J. Han et al., “Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection,” arXiv, 2022.
- ——, “Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment,” in CVPR, 2023.
- J. Lin, Y. Shen et al., “Weakly supervised open-vocabulary object detection,” arXiv, 2023.
- D. Kim, A. Angelova et al., “Region-aware pretraining for open-vocabulary object detection with vision transformers,” in CVPR, 2023.
- ——, “Contrastive feature masking open-vocabulary vision transformer,” in ICCV, 2023.
- ——, “Detection-oriented image-text pretraining for open-vocabulary detection,” arXiv, 2023.
- J. Wang, H. Zhang et al., “Open-vocabulary object detection with an open corpus,” in ICCV, 2023.
- H. Song and J. Bang, “Prompt-guided transformers for end-to-end open-vocabulary object detection,” arXiv, 2023.
- X. Wu, F. Zhu et al., “Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching,” in CVPR, 2023.
- C. Shi and S. Yang, “Edadet: Open-vocabulary object detection using early dense alignment,” in ICCV, 2023.
- M. Maaz, H. Rasheed et al., “Class-agnostic object detection with multi-modal transformer,” in ECCV, 2022.
- Y. Xu, M. Zhang et al., “Multi-modal queried object detection in the wild,” arXiv, 2023.
- T. Cheng, L. Song et al., “Yolo-world: Real-time open-vocabulary object detection,” in CVPR, 2024.
- H. Shi, M. Hayat et al., “Open-vocabulary object detection via scene graph discovery,” arXiv, 2023.
- Y. Zhong, J. Yang et al., “Regionclip: Region-based language-image pretraining,” in CVPR, 2022.
- S. Zhao, Z. Zhang et al., “Exploiting unlabeled data with vision and language models for object detection,” in ECCV, 2022.
- H. Zhang, P. Zhang et al., “Glipv2: Unifying localization and vision-language understanding,” NeurIPS, 2022.
- S. Antol, A. Agrawal et al., “Vqa: Visual question answering,” in ICCV, 2015.
- P. Anderson, X. He et al., “Bottom-up and top-down attention for image captioning and visual question answering,” in CVPR, 2018.
- K. Xu, J. Ba et al., “Show, attend and tell: Neural image caption generation with visual attention,” in ICML, 2015.
- S. J. Rennie, E. Marcheret et al., “Self-critical sequence training for image captioning,” in CVPR, 2017.
- S. Liu, Z. Zeng et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv, 2023.
- C. Feng, Y. Zhong et al., “Promptdet: Towards open-vocabulary detection using uncurated images,” in ECCV, 2022.
- Y.-C. Liu, C.-Y. Ma et al., “Unbiased teacher for semi-supervised object detection,” in ICLR, 2020.
- M. Xu, Z. Zhang et al., “End-to-end semi-supervised object detection with soft teacher,” in ICCV, 2021.
- S. Zhao, S. Schulter et al., “Improving pseudo labels for open-vocabulary object detection,” arXiv, 2023.
- R. Arandjelović, A. Andonian et al., “Three ways to improve feature alignment for open vocabulary detection,” arXiv, 2023.
- H. Bangalath, M. Maaz et al., “Bridging the gap between object and image-level representations for open-vocabulary detection,” NeurIPS, 2022.
- R. R. Selvaraju, M. Cogswell et al., “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in ICCV, 2017.
- S. Wu, W. Zhang et al., “Clim: Contrastive language-image mosaic for region representation,” AAAI, 2024.
- Y. Long, J. Han et al., “Fine-grained visual–text prompt-driven self-training for open-vocabulary object detection,” TNNLS, 2023.
- J. Jeong, G. Park et al., “Proxydet: Synthesizing proxy novel classes via classwise mixup for open vocabulary object detection,” arXiv, 2023.
- C. Ma, Y. Jiang et al., “Codet: Co-occurrence guided region-word alignment for open-vocabulary object detection,” NeurIPS, 2024.
- O. Russakovsky, J. Deng et al., “Imagenet large scale visual recognition challenge,” IJCV, 2015.
- P. Kaul, W. Xie et al., “Multi-modal classifiers for open-vocabulary object detection,” arXiv, 2023.
- T. Brown, B. Mann et al., “Language models are few-shot learners,” NeurIPS, 2020.
- S. Kang, J. Cha et al., “Learning pseudo-labeler beyond noun concepts for open-vocabulary object detection,” arXiv, 2023.
- H.-C. Cho, W. Y. Jhoo et al., “Open-vocabulary object detection using pseudo caption labels,” arXiv, 2023.
- e. a. Xie, Johnathan, “Zero-shot object detection through vision-language embedding alignment,” in ICDMW, 2022.
- C. Pham, T. Vu et al., “Lp-ovod: Open-vocabulary object detection by linear probing,” in WACV, 2024.
- Z. Liu, X. Hu et al., “Efficient feature distillation for zero-shot detection,” arXiv, 2023.
- R. Fang, G. Pang et al., “Simple image-level classification improves open-vocabulary object detection,” arXiv, 2023.
- A. v. d. Oord, Y. Li et al., “Representation learning with contrastive predictive coding,” arXiv, 2018.
- L. Wang, Y. Liu et al., “Object-aware distillation pyramid for open-vocabulary object detection,” in CVPR, 2023.
- J. Lin and S. Gong, “Gridclip: One-stage object detection by grid-level clip representation learning,” arXiv, 2023.
- L. Li, J. Miao et al., “Distilling detr with visual-linguistic knowledge for open-vocabulary object detection,” in ICCV, 2023.
- Z. Ma, G. Luo et al., “Open-vocabulary one-stage detection with hierarchical visual-language knowledge distillation,” in CVPR, 2022.
- M. Minderer, A. Gritsenko et al., “Simple open-vocabulary object detection with vision transformers,” arXiv, 2022.
- Z. Wang, Y. Li et al., “Detecting everything in the open world: Towards universal object detection,” in CVPR, 2023.
- Y. Chen, M. Wang et al., “Scaledet: A scalable multi-dataset object detector,” in CVPR, 2023.
- H. Zhang, F. Li et al., “A simple framework for open-vocabulary segmentation and detection,” arXiv, 2023.
- J. Li, C. Xie et al., “What makes good open-vocabulary detector: A disassembling perspective,” arXiv, 2023.
- X. Han, L. Wei et al., “Boosting segment anything model towards open-vocabulary learning,” arXiv, 2023.
- A. Kirillov, E. Mintun et al., “Segment anything,” arXiv, 2023.
- M. F. Naeem, Y. Xian et al., “Silc: Improving vision language pretraining with self-distillation,” arXiv, 2023.
- E. Jang, S. Gu et al., “Categorical reparameterization with gumbel-softmax,” arXiv, 2016.
- Q. Liu, Y. Wen et al., “Open-world semantic segmentation via contrasting and clustering vision-language embedding,” in ECCV, 2022.
- M. Caron, I. Misra et al., “Unsupervised learning of visual features by contrasting cluster assignments,” NeurIPS, 2020.
- M. Tschannen, J. Djolonga et al., “On mutual information maximization for representation learning,” arXiv, 2019.
- H. Luo, J. Bao et al., “Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation,” arXiv, 2022.
- J. Xu, J. Hou et al., “Learning open-vocabulary semantic segmentation models from natural language supervision,” in CVPR, 2023.
- F. Locatello, D. Weissenborn et al., “Object-centric learning with slot attention,” NeurIPS, 2020.
- J. Mukhoti, T.-Y. Lin et al., “Open vocabulary semantic segmentation with patch aligned contrastive learning,” in CVPR, 2023.
- J. Cha, J. Mun et al., “Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs,” in CVPR, 2023.
- M. Xu, Z. Zhang et al., “A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model,” in ECCV, 2022.
- H. Chefer, S. Gur et al., “Transformer interpretability beyond attention visualization,” in CVPR, 2021.
- J. Chen, D. Zhu et al., “Exploring open-vocabulary semantic segmentation from clip vision encoder distillation only,” in ICCV, 2023.
- R. Ranftl, A. Bochkovskiy et al., “Vision transformers for dense prediction,” in ICCV, 2021.
- X. Liu, B. Tian et al., “Delving into shape-aware zero-shot semantic segmentation,” in CVPR, 2023.
- S. D. Dao, H. Shi et al., “Class enhancement losses with pseudo labels for open-vocabulary semantic segmentation,” TMM, 2023.
- G. Shin, W. Xie et al., “Reco: Retrieve and co-segment for zero-shot transfer,” NeurIPS, 2022.
- K. He, H. Fan et al., “Momentum contrast for unsupervised visual representation learning,” in CVPR, 2020.
- Y. Rao, W. Zhao et al., “Denseclip: Language-guided dense prediction with context-aware prompting,” in CVPR, 2022.
- Y. Liu, S. Bai et al., “Open-vocabulary segmentation with semantic-assisted calibration,” arXiv, 2023.
- C. Zhou, C. C. Loy et al., “Extract free dense labels from clip,” in ECCV, 2022.
- M. Wysoczańska, O. Siméoni et al., “Clip-dinoiser: Teaching clip a few dino tricks,” arXiv, 2023.
- O. Siméoni, C. Sekkat et al., “Unsupervised object localization: Observing the background to discover objects,” in CVPR, 2023.
- J. Guo, Q. Wang et al., “Mvp-seg: Multi-view prompt learning for open-vocabulary semantic segmentation,” arXiv, 2023.
- R. Burgert, K. Ranasinghe et al., “Peekaboo: Text to image diffusion models are zero-shot segmentors,” arXiv, 2022.
- L. Karazija, I. Laina et al., “Diffusion models for zero-shot open-vocabulary segmentation,” arXiv, 2023.
- L. Barsellotti, R. Amoroso et al., “Fossil: Free open-vocabulary semantic segmentation through synthetic references retrieval,” in WACV, 2024, pp. 1464–1473.
- S. Ren, A. Zhang et al., “Prompt pre-training with twenty-thousand classes for open-vocabulary visual recognition,” arXiv, 2023.
- C. Ma, Y. Yang et al., “Open-vocabulary semantic segmentation via attribute decomposition-aggregation,” in NeurIPS, 2023.
- L. Jiayun, S. Khandelwal et al., “Plug-and-play, dense-label-free extraction of open-vocabulary semantic segmentation from vision-language models,” arXiv, 2023.
- Q. Liu, K. Zheng et al., “Tagalign: Improving vision-language alignment with multi-tag classification,” arXiv, 2023.
- O. Ülger, M. Kulicki et al., “Self-guided open-vocabulary semantic segmentation,” arXiv, 2023.
- J. Li, D. Li et al., “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, 2022.
- ——, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in ICML, 2023.
- X. Zou, Z.-Y. Dou et al., “Generalized decoding for pixel, image, and language,” in CVPR, 2023.
- H. Touvron, L. Martin et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv, 2023.
- S. Cho, H. Shin et al., “Cat-seg: Cost aggregation for open-vocabulary semantic segmentation,” arXiv, 2023.
- B. Xie, J. Cao et al., “Sed: A simple encoder-decoder for open-vocabulary semantic segmentation,” arXiv, 2023.
- Z. Liu, H. Mao et al., “A convnet for the 2020s,” in CVPR, 2022.
- S. Jiao, Y. Wei et al., “Learning mask-aware clip representations for zero-shot segmentation,” NeurIPS, 2023.
- Z. Zhou, Y. Lei et al., “Zegclip: Towards adapting clip for zero-shot semantic segmentation,” in CVPR, 2023.
- J. Li, P. Chen et al., “Tagclip: Improving discrimination ability of open-vocabulary semantic segmentation,” arXiv, 2023.
- T. Lüddecke and A. Ecker, “Image segmentation using text and image prompts,” in CVPR, 2022.
- O. Ronneberger, P. Fischer et al., “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015.
- S. Sun, R. Li et al., “Clip as rnn: Segment countless visual concepts without training endeavor,” arXiv, 2023.
- A. Shtedritski, C. Rupprecht et al., “What does clip know about a red circle? visual prompt engineering for vlms,” in ICCV, 2023.
- S. He, H. Ding et al., “Semantic-promoted debiasing and background disambiguation for zero-shot instance segmentation,” in CVPR, 2023.
- V. VS, N. Yu et al., “Mask-free ovis: Open-vocabulary instance segmentation without manual mask annotations,” in CVPR, 2023.
- J. Xie, W. Li et al., “Mosaicfusion: Diffusion models as data augmenters for large vocabulary instance segmentation,” arXiv, 2023.
- Z. Wang, X. Xia et al., “Open-vocabulary segmentation with unpaired mask-text supervision,” arXiv, 2024.
- J. Qin, J. Wu et al., “Freeseg: Unified, universal and open-vocabulary image segmentation,” in CVPR, 2023.
- D. Wang, E. Shelhamer et al., “Tent: Fully test-time adaptation by entropy minimization,” arXiv, 2020.
- M. Shu, W. Nie et al., “Test-time prompt tuning for zero-shot generalization in vision-language models,” NeruIPS, 2022.
- V. VS, S. Borse et al., “Possam: Panoptic open-vocabulary segment anything,” arXiv, 2024.
- X. Xu, T. Xiong et al., “Masqclip for open-vocabulary universal image segmentation,” in ICCV, 2023.
- X. Li, H. Yuan et al., “Omg-seg: Is one model good enough for all segmentation?” arXiv, 2024.
- F. Li, H. Zhang et al., “Semantic-sam: Segment and recognize anything at any granularity,” arXiv, 2023.
- X. Wang, S. Li et al., “Hierarchical open-vocabulary universal image segmentation,” arXiv, 2023.
- Z. Ding, J. Wang et al., “Open-vocabulary panoptic segmentation with maskclip,” arXiv, 2022.
- X. Chen, S. Li et al., “Open-vocabulary panoptic segmentation with embedding modulation,” arXiv, 2023.
- T. Ren, S. Liu et al., “Grounded sam: Assembling open-world models for diverse visual tasks,” arXiv, 2024.
- H. Zhang, J. Xu et al., “Opensight: A simple open-vocabulary framework for lidar-based object detection,” arXiv, 2023.
- C. Zhu, W. Zhang et al., “Object2scene: Putting objects in context for open-vocabulary 3d detection,” arXiv, 2023.
- R. Ding, J. Yang et al., “Pla: Language-driven open-vocabulary 3d scene understanding,” in CVPR, 2023.
- A. Radford, J. Wu et al., “Language models are unsupervised multitask learners,” OpenAI blog, 2019.
- J. Yang, R. Ding et al., “Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding,” arXiv, 2023.
- A. Takmaz, E. Fedele et al., “Openmask3d: Open-vocabulary 3d instance segmentation,” 2023.
- M. Yan, J. Zhang et al., “Maskclustering: View consensus based mask graph clustering for open-vocabulary 3d instance segmentation,” arXiv, 2024.
- Z. Huang, X. Wu et al., “Openins3d: Snap and lookup for 3d open-vocabulary instance segmentation,” arXiv, 2023.
- P. D. Nguyen, T. D. Ngo et al., “Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance,” arXiv, 2023.
- H. Wang, C. Yan et al., “Towards open-vocabulary video instance segmentation,” in ICCV, 2023.
- Z. Cheng, K. Li et al., “Instance brownian bridge as texts for open-vocabulary video instance segmentation,” arXiv, 2024.
- T. Zhang, X. Tian et al., “Dvis++: Improved decoupled framework for universal video segmentation,” arXiv, 2023.
- D. Kim, T.-Y. Lin et al., “Learning open-world object proposals without learning to classify,” Robotics and Automation, 2022.
- O. Siméoni, É. Zablocki et al., “Unsupervised object localization in the era of self-supervised vits: A survey,” arXiv, 2023.
- H. Zhou, T. Shen et al., “Rethinking evaluation metrics of open-vocabulary segmentaion,” arXiv, 2023.
- K. Gao, L. Chen et al., “Compositional prompt tuning with motion cues for open-vocabulary video relation detection,” in ICLR, 2023.
- L. Li, J. Xiao et al., “Zero-shot visual relation detection via composite visual cues from large language models,” arXiv, 2023.
- X. Gu, Y. Cui et al., “Dataseg: Taming a universal multi-dataset multi-task segmentation model,” arXiv, 2023.
- Y. Zang, W. Li et al., “Contextual object detection with multimodal large language models,” arXiv, 2023.
- R. Pi, J. Gao et al., “Detgpt: Detect what you need via reasoning,” arXiv, 2023.
- W. Wang, Z. Chen et al., “Visionllm: Large language model is also an open-ended decoder for vision-centric tasks,” NeurIPS, 2024.
- X. Lai, Z. Tian et al., “Lisa: Reasoning segmentation via large language model,” arXiv, 2023.
- T. Chen, S. Saxena et al., “Pix2seq: A language modeling framework for object detection,” arXiv, 2021.
- ——, “A unified sequence interface for vision tasks,” NeurIPS, 2022.
- W. Lv, S. Xu et al., “Detrs beat yolos on real-time object detection,” arXiv, 2023.
- R. Mottaghi, X. Chen et al., “The role of context for object detection and semantic segmentation in the wild,” in CVPR, 2014.
- S. Shao, Z. Li et al., “Objects365: A large-scale, high-quality dataset for object detection,” in ICCV, 2019.
- A. Kuznetsova, H. Rom et al., “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,” IJCV, 2020.
- H. Caesar, J. Uijlings et al., “Coco-stuff: Thing and stuff classes in context,” in CVPR, 2018.
- B. Zhou, H. Zhao et al., “Scene parsing through ade20k dataset,” in CVPR, 2017.
- M. Cordts, M. Omran et al., “The cityscapes dataset for semantic urban scene understanding,” in CVPR, 2016.
- S. Song, S. P. Lichtenberg et al., “Sun rgb-d: A rgb-d scene understanding benchmark suite,” in CVPR, 2015.
- A. Dai, A. X. Chang et al., “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in CVPR, 2017.
- D. Rozenberszki, O. Litany et al., “Language-grounded indoor 3d semantic segmentation in the wild,” in ECCV, 2022.
- L. Yang, Y. Fan et al., “Video instance segmentation,” in ICCV, 2019.
- A. Athar, J. Luiten et al., “Burst: A benchmark for unifying object recognition, segmentation and tracking in video,” in WACV, 2023.
- C. Szegedy, S. Ioffe et al., “Inception-v4, inception-resnet and the impact of residual connections on learning,” in AAAI, 2017.
- K. He, X. Zhang et al., “Deep residual learning for image recognition,” in CVPR, 2016.
- A. Farhadi, I. Endres et al., “Describing objects by their attributes,” in CVPR, 2009.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv, 2014.
- Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” in CVPR, 2018, pp. 4490–4499.
- J. Schult, F. Engelmann et al., “Mask3d: Mask transformer for 3d semantic instance segmentation,” in ICRA, 2023.
- L. Qi, J. Kuen et al., “High-quality entity segmentation,” arXiv, 2022.
- A. Bewley, Z. Ge et al., “Simple online and realtime tracking,” in ICIP, 2016.
- Y. Shen, R. Ji et al., “Enabling deep residual networks for weakly supervised object detection,” in ECCV, 2020.
- G. Zhang, Z. Luo et al., “Accelerating detr convergence via semantic-aligned matching,” in CVPR, 2022.
- Y. Liu, M. Ott et al., “Roberta: A robustly optimized bert pretraining approach,” arXiv, 2019.
- T. Wang, “Learning to detect and segment for open vocabulary object detection,” in CVPR, 2023.
- C. Schuhmann, R. Vencu et al., “Laion-400m: Open dataset of clip-filtered 400 million image-text pairs,” arXiv, 2021.
- V. Ordonez, G. Kulkarni et al., “Im2text: Describing images using 1 million captioned photographs,” NeurIPS, 2011.
- C.-Y. Wang, H.-Y. M. Liao et al., “Cspnet: A new backbone that can enhance learning capability of cnn,” in CVPRW, 2020.
- [Online]. Available: https://github.com/ultralytics/yolov5
- S. Zhang, C. Chi et al., “Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection,” in CVPR, 2020.
- X. Zhou, V. Koltun et al., “Probabilistic two-stage detection,” arXiv, 2021.
- A. Brock, S. De et al., “High-performance large-scale image recognition without normalization,” in ICML, 2021.
- X. Zhai, X. Wang et al., “Lit: Zero-shot transfer with locked-image text tuning,” in CVPR, 2022.
- B. Thomee, D. A. Shamma et al., “Yfcc100m: The new data in multimedia research,” ACM Communications, 2016.
- X. Dai, Y. Chen et al., “Dynamic head: Unifying object detection heads with attentions,” in CVPR, 2021.
- H. Zhang, F. Li et al., “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” arXiv, 2022.
- F. Li, H. Zhang et al., “Mask dino: Towards a unified transformer-based framework for object detection and segmentation,” in CVPR, 2023.
- Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” in CVPR, 2018.
- X. Chen, X. Wang et al., “Pali: A jointly-scaled multilingual language-image model,” arXiv, 2022.
- B. Zhang, Z. Tian et al., “Segvit: Semantic segmentation with plain vision transformers,” NeurIPS, 2022.
- M. Ding, B. Xiao et al., “Davit: Dual attention vision transformers,” in ECCV, 2022.
- Chaoyang Zhu (7 papers)
- Long Chen (396 papers)