Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
122 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
48 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models (2410.19635v1)

Published 25 Oct 2024 in cs.CV

Abstract: Recent vision foundation models can extract universal representations and show impressive abilities in various tasks. However, their application on object detection is largely overlooked, especially without fine-tuning them. In this work, we show that frozen foundation models can be a versatile feature enhancer, even though they are not pre-trained for object detection. Specifically, we explore directly transferring the high-level image understanding of foundation models to detectors in the following two ways. First, the class token in foundation models provides an in-depth understanding of the complex scene, which facilitates decoding object queries in the detector's decoder by providing a compact context. Additionally, the patch tokens in foundation models can enrich the features in the detector's encoder by providing semantic details. Utilizing frozen foundation models as plug-and-play modules rather than the commonly used backbone can significantly enhance the detector's performance while preventing the problems caused by the architecture discrepancy between the detector's backbone and the foundation model. With such a novel paradigm, we boost the SOTA query-based detector DINO from 49.0% AP to 51.9% AP (+2.9% AP) and further to 53.8% AP (+4.8% AP) by integrating one or two foundation models respectively, on the COCO validation set after training for 12 epochs with R50 as the detector's backbone.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  2. Language models are few-shot learners. In NeurIPS, 2020.
  3. End-to-end object detection with transformers. In ECCV, 2020.
  4. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  5. Detrdistill: A universal knowledge distillation framework for detr-families. In ICCV, 2023.
  6. Enhanced training of query-based object detection via selective query recollection. In CVPR, 2023.
  7. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
  8. Group detr: Fast detr training with group-wise one-to-many assignment. In ICCV, 2023.
  9. D3etr: Decoder distillation for detection transformer. arXiv preprint arXiv:2211.09768, 2022.
  10. Deconstructing denoising diffusion models for self-supervised learning. arXiv preprint arXiv:2401.14404, 2024.
  11. An empirical study of training self-supervised vision transformers. In ICCV, 2021.
  12. Hierarchical context embedding for region-based object detection. In ECCV, 2020.
  13. Vision transformer adapter for dense predictions. In ICLR, 2023.
  14. Context refinement for object detection. In ECCV, 2018.
  15. Vision transformers need registers. In ICLR, 2023.
  16. Scaling vision transformers to 22 billion parameters. In ICML, 2023.
  17. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  18. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In CVPR, 2022.
  19. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  20. Learning to prompt for open-vocabulary object detection with vision-language model. In CVRP, 2022.
  21. Data filtering networks. arXiv preprint arXiv:2309.17425, 2023.
  22. Asag: Building strong one-decoder-layer sparse detectors via adaptive sparse anchor generation. In ICCV, 2023.
  23. Res2net: A new multi-scale backbone architecture. IEEE TPAMI, 2019.
  24. Adamixer: A fast-converging query-based object detector. In CVPR, 2022.
  25. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
  26. Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
  27. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  28. Rethinking imagenet pre-training. In CVPR, 2019.
  29. Mask r-cnn. In ICCV, 2017.
  30. Deep residual learning for image recognition. In CVPR, 2016.
  31. Using pre-training can improve model robustness and uncertainty. In ICML, 2019.
  32. Dac-detr: Divide the attention layers and conquer. In NeurIPS, 2024.
  33. Teach-detr: Better training detr with teachers. IEEE TPAMI, 2023.
  34. Detrs with hybrid matching. In CVPR, 2023.
  35. Segment anything. In ICCV, 2023.
  36. Dn-detr: Accelerate detr training by introducing query denoising. In CVPR, 2022.
  37. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
  38. Distilling detr with visual-linguistic knowledge for open-vocabulary object detection. In ICCV, 2023.
  39. Grounded language-image pre-training. In CVPR, 2022.
  40. Exploring plain vision transformer backbones for object detection. In ECCV, 2022.
  41. Learning object-language alignments for open-vocabulary object detection. In ICLR, 2023.
  42. Feature pyramid networks for object detection. In CVPR, 2017.
  43. Microsoft coco: Common objects in context. In ECCV, 2014.
  44. Could giant pre-trained image models extract universal representations? In NeurIPS, 2022.
  45. Dab-detr: Dynamic anchor boxes are better queries for detr. In ICLR, 2022.
  46. Detection transformer with stable matching. In ICCV, 2023.
  47. Sap-detr: Bridging the gap between salient points and queries-based transformer detector for fast model convergency. arXiv preprint arXiv:2211.02006, 2022.
  48. Structure inference net: Object detection using scene-level context and instance-level relationships. In CVPR, 2018.
  49. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  50. Coco-o: A benchmark for object detectors under natural distribution shifts. In ICCV, 2023.
  51. Bridge past and future: Overcoming information asymmetry in incremental object detection. In ECCV, 2024.
  52. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  53. Rank-detr for high quality object detection. In NeurIPS, 2024.
  54. Learning transferable visual models from natural language supervision. In ICML, 2021.
  55. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
  56. Sparse r-cnn: End-to-end object detection with learnable proposals. In CVPR, 2021.
  57. Deit iii: Revenge of the vit. In ECCV, 2022.
  58. Proper reuse of image classification features improves object detection. In CVPR, 2022.
  59. Attention is all you need. In NeurIPS, 2017.
  60. Is imagenet worth 1 video? learning strong image encoders from 1 long unlabelled video. In ICLR, 2024.
  61. Image as a foreign language: Beit pretraining for vision and vision-language tasks. In CVPR, 2023.
  62. Anchor detr: Query design for transformer-based detector. In AAAI, 2022.
  63. Aligning bag of regions for open-vocabulary object detection. In CVPR, 2023.
  64. Clipself: Vision transformer distills itself for open-vocabulary dense prediction. In ICLR, 2024.
  65. Self-supervised cross-stage regional contrastive learning for object detection. In ICME, 2023.
  66. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In ICLR, 2023.
  67. Dense distinct query for end-to-end object detection. In CVPR, 2023.
  68. Ms-detr: Efficient detr training with mixed supervision. In CVPR, 2024.
  69. Hybrid proposal refiner: Revisiting detr series from the faster r-cnn perspective. In CVPR, 2024.
  70. ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.
  71. Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022.
  72. Probabilistic two-stage detection. arXiv preprint arXiv:2103.07461, 2021.
  73. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2021.
  74. Detrs with collaborative hybrid assignments training. In ICCV, 2023.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com