Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Taming Self-Training for Open-Vocabulary Object Detection (2308.06412v3)

Published 11 Aug 2023 in cs.CV

Abstract: Recent studies have shown promising performance in open-vocabulary object detection (OVD) by utilizing pseudo labels (PLs) from pretrained vision and LLMs (VLMs). However, teacher-student self-training, a powerful and widely used paradigm to leverage PLs, is rarely explored for OVD. This work identifies two challenges of using self-training in OVD: noisy PLs from VLMs and frequent distribution changes of PLs. To address these challenges, we propose SAS-Det that tames self-training for OVD from two key perspectives. First, we present a split-and-fusion (SAF) head that splits a standard detection into an open-branch and a closed-branch. This design can reduce noisy supervision from pseudo boxes. Moreover, the two branches learn complementary knowledge from different training data, significantly enhancing performance when fused together. Second, in our view, unlike in closed-set tasks, the PL distributions in OVD are solely determined by the teacher model. We introduce a periodic update strategy to decrease the number of updates to the teacher, thereby decreasing the frequency of changes in PL distributions, which stabilizes the training process. Extensive experiments demonstrate SAS-Det is both efficient and effective. SAS-Det outperforms recent models of the same scale by a clear margin and achieves 37.4 AP50 and 29.1 APr on novel categories of the COCO and LVIS benchmarks, respectively. Code is available at \url{https://github.com/xiaofeng94/SAS-Det}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Flamingo: A visual language model for few-shot learning. NeurIPS, 35:23716–23736, 2022.
  2. Three ways to improve feature alignment for open vocabulary detection, 2022.
  3. Zero-shot object detection. In ECCV, pages 384–400, 2018.
  4. Soft-NMS–improving object detection with one line of code. In ICCV, pages 5561–5569, 2017.
  5. X-DETR: A versatile architecture for instance-wise vision-language tasks. In ECCV, pages 290–308. Springer, 2022.
  6. End-to-End Object Detection with Transformers. In ECCV, 2020.
  7. Open vocabulary object detection with proposal mining and prediction equalization, 2022.
  8. Revisiting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens. In CVPR, 2023.
  9. Dynamic head: Unifying object detection heads with attentions. In CVPR, pages 7373–7382, 2021.
  10. Learning to prompt for open-vocabulary object detection with vision-language model. In CVPR, pages 14084–14093, 2022.
  11. Promptdet: Towards open-vocabulary detection using uncurated images. In ECCV, pages 701–717. Springer, 2022.
  12. Open vocabulary object detection with pseudo bounding-box labels. In ECCV, pages 266–282. Springer, 2022.
  13. Simple copy-paste is a strong data augmentation method for instance segmentation. In CVPR, pages 2918–2928, 2021.
  14. Open-vocabulary Object Detection via Vision and Language Knowledge Distillation. In ICLR, 2022.
  15. LVIS: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
  16. Mask R-CNN. In ICCV, 2017.
  17. GQA: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019.
  18. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In ICML, 2021.
  19. MDETR – Modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1780–1790, 2021a.
  20. MDETR – Modulated Detection for End-to-End Multi-Modal Understanding. In ICCV, 2021b.
  21. Single-stream multi-level alignment for vision-language pretraining. In ECCV, pages 735–751. Springer, 2022.
  22. Region-aware pretraining for open-vocabulary object detection with vision transformers. In CVPR, pages 11144–11154, 2023.
  23. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123:32–73, 2017.
  24. F-vlm: Open-vocabulary object detection upon frozen vision and language models. In ICLR, 2023.
  25. Pseco: Pseudo labeling and consistency training for semi-supervised object detection. In ECCV, pages 457–472. Springer, 2022a.
  26. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021.
  27. Grounded language-image pre-training. In CVPR, pages 10965–10975, 2022b.
  28. Exploring plain vision transformer backbones for object detection. In ECCV, pages 280–296. Springer, 2022c.
  29. Learning object-language alignments for open-vocabulary object detection. In ICLR, 2023.
  30. Microsoft COCO: Common Objects in Context. In ECCV, 2014.
  31. Feature Pyramid Networks for Object Detection. In CVPR, 2017.
  32. Unbiased teacher for semi-supervised object detection. In ICLR, 2021.
  33. Simple open-vocabulary object detection with vision transformers. In ECCV. Springer, 2022.
  34. Flickr30k Entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, pages 2641–2649, 2015.
  35. Connecting vision and language with localized narratives. In ECCV, pages 647–664. Springer, 2020.
  36. Learning transferable visual models from natural language supervision. In ICML, 2021.
  37. Improved visual-semantic alignment for zero-shot object detection. In AAAI, pages 11932–11939, 2020.
  38. Yolo9000: better, faster, stronger. In CVPR, pages 7263–7271, 2017.
  39. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NeurIPS, 2015.
  40. Instance-aware, context-focused, and memory-efficient weakly supervised object detection. In CVPR, pages 10598–10607, 2020.
  41. Omnilabel: A challenging benchmark for language-based object detection. ICCV, 2023.
  42. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
  43. Conceptual Captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pages 2556–2565, 2018.
  44. R-FCN-3000 at 30fps: Decoupling detection and classification. In CVPR, pages 1081–1090, 2018.
  45. A simple semi-supervised learning framework for object detection. In arXiv:2005.04757, 2020.
  46. Multiple instance detection network with online instance classifier refinement. In CVPR, pages 2843–2851, 2017.
  47. C-MIL: Continuation multiple instance learning for weakly supervised object detection. In CVPR, pages 2199–2208, 2019.
  48. Aligning bag of regions for open-vocabulary object detection. In CVPR, 2023a.
  49. CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching. In CVPR, 2023b.
  50. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
  51. Self-training with noisy student improves imagenet classification. In CVPR, pages 10687–10698, 2020.
  52. End-to-end semi-supervised object detection with soft teacher. In ICCV, pages 3060–3069, 2021.
  53. Semantics-guided contrastive network for zero-shot object detection. IEEE TPAMI, 2022.
  54. Modeling context in referring expressions. In ECCV, pages 69–85. Springer, 2016.
  55. Open-vocabulary detr with conditional matching. In ECCV, pages 106–122. Springer, 2022.
  56. Open-Vocabulary Object Detection Using Captions. In CVPR, 2021.
  57. Exploiting unlabeled data with vision and language models for object detection. In ECCV, pages 159–175. Springer, 2022.
  58. RegionCLIP: Region-based Language-Image Pretraining. In CVPR, 2022.
  59. Detecting twenty-thousand classes using image-level supervision. In ECCV, pages 350–368. Springer, 2022.
  60. Don’t even look once: Synthesizing features for zero-shot detection. In CVPR, pages 11693–11702, 2020a.
  61. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Shiyu Zhao (55 papers)
  2. Samuel Schulter (32 papers)
  3. Long Zhao (64 papers)
  4. Zhixing Zhang (14 papers)
  5. Yumin Suh (16 papers)
  6. Manmohan Chandraker (108 papers)
  7. Dimitris N. Metaxas (84 papers)
  8. Vijay Kumar B. G (4 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com