Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection (2310.01393v3)

Published 2 Oct 2023 in cs.CV

Abstract: Open-vocabulary object detection (OVOD) aims to detect the objects beyond the set of classes observed during training. This work introduces a straightforward and efficient strategy that utilizes pre-trained vision-LLMs (VLM), like CLIP, to identify potential novel classes through zero-shot classification. Previous methods use a class-agnostic region proposal network to detect object proposals and consider the proposals that do not match the ground truth as background. Unlike these methods, our method will select a subset of proposals that will be considered as background during the training. Then, we treat them as novel classes during training. We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training. Compared to previous pseudo methods, our approach does not require re-training and offline labeling processing, which is more efficient and effective in one-shot training. Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance without incurring additional parameters or computational costs during inference. In addition, we also apply our method to various baselines. In particular, compared with the previous method, F-VLM, our method achieves a 1.7% improvement on the LVIS dataset. Combined with the recent method CLIPSelf, our method also achieves 46.7 novel class AP on COCO without introducing extra data for pertaining. We also achieve over 6.5% improvement over the F-VLM baseline in the recent challenging V3Det dataset. We release our code and models at https://github.com/xushilin1/dst-det.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  2. Three ways to improve feature alignment for open vocabulary detection. arXiv: 2303.13518, 2023.
  3. Localized vision-language matching for open-vocabulary object detection. GCPR, 2022.
  4. Enhancing the role of context in region-word alignment for object detection. arXiv: 2303.10093, 2023.
  5. End-to-end object detection with transformers. In ECCV, 2020.
  6. Open vocabulary object detection with proposal mining and prediction equalization. arXiv: 2206.11134, 2022.
  7. Open-vocabulary object detection using pseudo caption labels. arXiv: 2303.13040, 2023.
  8. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  9. Learning to prompt for open-vocabulary object detection with vision-language model. In CVPR, 2022.
  10. Promptdet: Towards open-vocabulary detection using uncurated images. In ECCV, 2022.
  11. Open vocabulary object detection with pseudo bounding-box labels. In ECCV, 2022.
  12. Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022.
  13. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
  14. Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
  15. Mask r-cnn. In ICCV, 2017.
  16. Open-vocabulary instance segmentation via robust cross-modal pseudo-labeling. In CVPR, 2022.
  17. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  18. Multi-modal classifiers for open-vocabulary object detection. ICML, 2023.
  19. Multi-modal classifiers for open-vocabulary object detection. In ICML, 2023.
  20. Contrastive feature masking open-vocabulary vision transformer. ICCV, 2023.
  21. Region-aware pretraining for open-vocabulary object detection with vision transformers. In CVPR, 2023.
  22. Vilt: Vision-and-language transformer without convolution or region supervision. In ICML, 2021.
  23. F-vlm: Open-vocabulary object detection upon frozen vision and language models. In ICLR, 2023.
  24. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  25. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021.
  26. Grounded language-image pre-training. In CVPR, 2022.
  27. Scaling language-image pre-training via masking. In CVPR, 2023.
  28. Overcoming classifier imbalance for long-tail object detection with balanced group softmax. In CVPR, 2020.
  29. Learning object-language alignments for open-vocabulary object detection. In ICLR, 2023.
  30. Gridclip: One-stage object detection by grid-level clip representation learning. In arXiv:2303.09252, 2023.
  31. Feature pyramid networks for object detection. In CVPR, 2017.
  32. Focal loss for dense object detection. In ICCV, 2017.
  33. Microsoft coco: Common objects in context. In ECCV, 2014.
  34. Deep representation learning on long-tailed data: A learnable embedding augmentation perspective. In CVPR, 2020.
  35. Dab-detr: Dynamic anchor boxes are better queries for detr. In ICLR, 2022.
  36. Ssd: Single shot multibox detector. In ECCV, 2015.
  37. Open-vocabulary one-stage detection with hierarchical visual-language knowledge distillation. In CVPR, 2022.
  38. Scaling open-vocabulary object detection. arXiv: 2306.09683, 2023.
  39. Simple open-vocabulary object detection with vision transformers. ECCV, 2022.
  40. Learning transferable visual models from natural language supervision. In ICML, 2021.
  41. Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR, 2022.
  42. Bridging the gap between object and image-level representations for open-vocabulary detection. In NeurIPS, 2022.
  43. You only look once: Unified, real-time object detection. In CVPR, 2016.
  44. Balanced meta-softmax for long-tailed visual recognition. In NeurIPS, 2020.
  45. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
  46. Edadet: Open-vocabulary object detection using early dense alignment. ICCV, 2023.
  47. Open-vocabulary object detection via scene graph discovery. arXiv: 2307.03339, 2023.
  48. Proposalclip: unsupervised open-category object proposal generation via exploiting clip cues. In CVPR, 2022.
  49. Prompt-guided transformers for end-to-end open-vocabulary object detection. arXiv: 2303.14386, 2023.
  50. SparseR-CNN: End-to-end object detection with learnable proposals. In CVPR, 2021.
  51. Equalization loss v2: A new gradient balance approach for long-tailed object detection. In CVPR, 2021.
  52. Equalization loss for long-tailed object recognition. In CVPR, 2020.
  53. Efficientdet: Scalable and efficient object detection. In CVPR, 2020.
  54. FCOS: A simple and strong anchor-free object detector. In TPAMI, 2021.
  55. V3det: Vast vocabulary visual detection dataset. ICCV, 2023.
  56. Learning to detect and segment for open vocabulary object detection. CVPR, 2023.
  57. The devil is in classification: A simple framework for long-tail instance segmentation. In ECCV, 2020.
  58. Adaptive class suppression loss for long-tail object detection. In CVPR, 2021.
  59. Forest r-cnn: Large-vocabulary long-tailed object detection and instance segmentation. In ACM-MM, 2020.
  60. Aligning bag of regions for open-vocabulary object detection. In CVPR, 2023.
  61. Clipself: Vision transformer distills itself for open-vocabulary dense prediction. arXiv preprint, 2023.
  62. Clim: Contrastive language-image mosaic for region representation. arXiv preprint arXiv:2312.11376, 2023.
  63. Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In CVPR, 2023.
  64. Mosaicfusion: Diffusion models as data augmenters for large vocabulary instance segmentation. arXiv preprint arXiv:2309.13042, 2023.
  65. ODISE: Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models. In CVPR, 2023.
  66. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. arXiv: 2308.02487, 2023.
  67. Open-vocabulary detr with conditional matching. In ECCV, 2022.
  68. Open-vocabulary object detection using captions. In CVPR, 2021.
  69. A simple framework for open-vocabulary segmentation and detection. rXiv: 2303.08131, 2023.
  70. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In CVPR, 2019.
  71. Distribution alignment: A unified framework for long-tail visual recognition. In CVPR, 2021.
  72. Improving pseudo labels for open-vocabulary object detection. arXiv preprint arXiv:2308.06412, 2023.
  73. Exploiting unlabeled data with vision and language models for object detection. In ECCV, 2022.
  74. Regionclip: Region-based language-image pretraining. In CVPR, 2022.
  75. Extract free dense labels from clip. In ECCV, 2022.
  76. Rethinking evaluation metrics of open-vocabulary segmentaion. arXiv preprint arXiv:2311.03352, 2023.
  77. Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022.
  78. Probabilistic two-stage detection. In arXiv:2103.07461, 2021.
  79. Objects as points. In arXiv:1904.07850, 2019.
  80. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Shilin Xu (17 papers)
  2. Xiangtai Li (128 papers)
  3. Size Wu (12 papers)
  4. Wenwei Zhang (77 papers)
  5. Yunhai Tong (69 papers)
  6. Chen Change Loy (288 papers)
Citations (12)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com