InstaGen: Enhancing Object Detection by Training on Synthetic Dataset (2402.05937v3)
Abstract: In this paper, we present a novel paradigm to enhance the ability of object detector, e.g., expanding categories or improving detection performance, by training on synthetic dataset generated from diffusion models. Specifically, we integrate an instance-level grounding head into a pre-trained, generative diffusion model, to augment it with the ability of localising instances in the generated images. The grounding head is trained to align the text embedding of category names with the regional visual feature of the diffusion model, using supervision from an off-the-shelf object detector, and a novel self-training scheme on (novel) categories not covered by the detector. We conduct thorough experiments to show that, this enhanced version of diffusion model, termed as InstaGen, can serve as a data synthesizer, to enhance object detectors by training on its generated samples, demonstrating superior performance over existing state-of-the-art methods in open-vocabulary (+4.5 AP) and data-sparse (+1.2 to 5.2 AP) scenarios. Project page with code: https://fcjian.github.io/InstaGen.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Learning to prompt for open-vocabulary object detection with vision-language model. In CVPR, pages 14084–14093, 2022.
- Tood: Task-aligned one-stage object detection. In ICCV, pages 3490–3499. IEEE Computer Society, 2021a.
- Exploring classification equilibrium in long-tailed object detection. In ICCV, pages 3417–3426, 2021b.
- Promptdet: Towards open-vocabulary detection using uncurated images. In ECCV, pages 701–717. Springer, 2022.
- Aedet: Azimuth-invariant multi-view 3d object detection. In CVPR, pages 21580–21588, 2023.
- Open vocabulary object detection with pseudo bounding-box labels. In ECCV, pages 266–282. Springer, 2022.
- Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
- Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
- Lvis: A dataset for large vocabulary instance segmentation. In CVPR, pages 5356–5364, 2019.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Mask r-cnn. In ICCV, pages 2961–2969, 2017.
- Region-aware pretraining for open-vocabulary object detection with vision transformers. In CVPR, pages 11144–11154, 2023.
- Glow: Generative flow with invertible 1x1 convolutions. NeurIPS, 31, 2018.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- F-vlm: Open-vocabulary object detection upon frozen vision and language models. 2022.
- Distilling detr with visual-linguistic knowledge for open-vocabulary object detection. In ICCV, pages 6501–6510, 2023a.
- Open-vocabulary object segmentation with diffusion models. In CVPR, pages 7667–7676, 2023b.
- Learning object-language alignments for open-vocabulary object detection. 2022.
- Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
- Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329, 2022.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Learning transferable visual models from natural language supervision. pages 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- You only look once: Unified, real-time object detection. In CVPR, pages 779–788, 2016.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 35:36479–36494, 2022.
- Objects365: A large-scale, high-quality dataset for object detection. In ICCV, pages 8430–8439, 2019.
- Edadet: Open-vocabulary object detection using early dense alignment. In ICCV, pages 15724–15734, 2023.
- Conditional image generation with pixelcnn decoders. NeurIPS, 29, 2016.
- Object-aware distillation pyramid for open-vocabulary object detection. In CVPR, pages 11186–11196, 2023.
- Aligning bag of regions for open-vocabulary object detection. In CVPR, pages 15254–15264, 2023a.
- Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In CVPR, pages 7031–7040, 2023b.
- Zsd-yolo: Zero-shot yolo detection using vision-language knowledge distillation. arXiv preprint arXiv:2109.12066, 2(3):4, 2021.
- Open-vocabulary object detection using captions. In CVPR, pages 14393–14402, 2021.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022.
- Exploiting unlabeled data with vision and language models for object detection. In ECCV, pages 159–175. Springer, 2022.
- Probabilistic two-stage detection. arXiv preprint arXiv:2103.07461, 2021.
- Detecting twenty-thousand classes using image-level supervision. In ECCV, pages 350–368. Springer, 2022.
- Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.