Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InstaGen: Enhancing Object Detection by Training on Synthetic Dataset (2402.05937v3)

Published 8 Feb 2024 in cs.CV

Abstract: In this paper, we present a novel paradigm to enhance the ability of object detector, e.g., expanding categories or improving detection performance, by training on synthetic dataset generated from diffusion models. Specifically, we integrate an instance-level grounding head into a pre-trained, generative diffusion model, to augment it with the ability of localising instances in the generated images. The grounding head is trained to align the text embedding of category names with the regional visual feature of the diffusion model, using supervision from an off-the-shelf object detector, and a novel self-training scheme on (novel) categories not covered by the detector. We conduct thorough experiments to show that, this enhanced version of diffusion model, termed as InstaGen, can serve as a data synthesizer, to enhance object detectors by training on its generated samples, demonstrating superior performance over existing state-of-the-art methods in open-vocabulary (+4.5 AP) and data-sparse (+1.2 to 5.2 AP) scenarios. Project page with code: https://fcjian.github.io/InstaGen.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  2. Learning to prompt for open-vocabulary object detection with vision-language model. In CVPR, pages 14084–14093, 2022.
  3. Tood: Task-aligned one-stage object detection. In ICCV, pages 3490–3499. IEEE Computer Society, 2021a.
  4. Exploring classification equilibrium in long-tailed object detection. In ICCV, pages 3417–3426, 2021b.
  5. Promptdet: Towards open-vocabulary detection using uncurated images. In ECCV, pages 701–717. Springer, 2022.
  6. Aedet: Azimuth-invariant multi-view 3d object detection. In CVPR, pages 21580–21588, 2023.
  7. Open vocabulary object detection with pseudo bounding-box labels. In ECCV, pages 266–282. Springer, 2022.
  8. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  9. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
  10. Lvis: A dataset for large vocabulary instance segmentation. In CVPR, pages 5356–5364, 2019.
  11. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  12. Mask r-cnn. In ICCV, pages 2961–2969, 2017.
  13. Region-aware pretraining for open-vocabulary object detection with vision transformers. In CVPR, pages 11144–11154, 2023.
  14. Glow: Generative flow with invertible 1x1 convolutions. NeurIPS, 31, 2018.
  15. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  16. F-vlm: Open-vocabulary object detection upon frozen vision and language models. 2022.
  17. Distilling detr with visual-linguistic knowledge for open-vocabulary object detection. In ICCV, pages 6501–6510, 2023a.
  18. Open-vocabulary object segmentation with diffusion models. In CVPR, pages 7667–7676, 2023b.
  19. Learning object-language alignments for open-vocabulary object detection. 2022.
  20. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
  21. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329, 2022.
  22. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  23. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  24. Learning transferable visual models from natural language supervision. pages 8748–8763. PMLR, 2021.
  25. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  26. You only look once: Unified, real-time object detection. In CVPR, pages 779–788, 2016.
  27. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
  28. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  29. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 35:36479–36494, 2022.
  30. Objects365: A large-scale, high-quality dataset for object detection. In ICCV, pages 8430–8439, 2019.
  31. Edadet: Open-vocabulary object detection using early dense alignment. In ICCV, pages 15724–15734, 2023.
  32. Conditional image generation with pixelcnn decoders. NeurIPS, 29, 2016.
  33. Object-aware distillation pyramid for open-vocabulary object detection. In CVPR, pages 11186–11196, 2023.
  34. Aligning bag of regions for open-vocabulary object detection. In CVPR, pages 15254–15264, 2023a.
  35. Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In CVPR, pages 7031–7040, 2023b.
  36. Zsd-yolo: Zero-shot yolo detection using vision-language knowledge distillation. arXiv preprint arXiv:2109.12066, 2(3):4, 2021.
  37. Open-vocabulary object detection using captions. In CVPR, pages 14393–14402, 2021.
  38. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022.
  39. Exploiting unlabeled data with vision and language models for object detection. In ECCV, pages 159–175. Springer, 2022.
  40. Probabilistic two-stage detection. arXiv preprint arXiv:2103.07461, 2021.
  41. Detecting twenty-thousand classes using image-level supervision. In ECCV, pages 350–368. Springer, 2022.
  42. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
Citations (9)

Summary

  • The paper introduces InstaGen, a framework that generates synthetic images with precise instance-level annotations for object detection.
  • It employs a two-step training strategy that combines detector supervision with self-training to enhance performance in challenging scenarios.
  • Empirical results demonstrate notable improvements, including a +4.5 AP boost in open-vocabulary detection and gains in cross-dataset transfer.

Introduction to InstaGen: A Paradigm Shift in Object Detection Training

In the arena of object detection within the formidable field of computer vision, the traditional route has been arduous—relying heavily on the acquisition of large-scale, meticulously annotated datasets. These datasets, characterized by their exhaustive bounding box annotations and category labels, pose significant limitations in terms of scalability and adaptability. Enter InstaGen, a groundbreaking framework designed to transcend these limitations by employing synthetic datasets generated from diffusion models for training object detectors. This novel approach not only broadens the spectrum of detectable categories but also enhances detection performance, especially in scenarios plagued by data scarcity.

The Genesis of InstaGen

InstaGen is born out of the observation that contemporary text-to-image diffusion models, despite their success in generating photorealistic images, fall short when it comes to supporting the sophisticated needs of object detection training. Addressing this gap, InstaGen pioneers an ingenious paradigm by augmenting these models to produce not just images but also precise instance-level bounding boxes.

The core innovation of InstaGen lies in its instance grounding module. This component is adept at discerning arbitrary objects in the generated images by aligning the text embedding of category names with the regional visual features produced by the diffusion model. The process involves a meticulously designed two-step training strategy that leverages supervision from an existing object detector and a novel self-training regimen for categories beyond the detector's knowledge scope.

The Impact and Prowess of InstaGen

Empirical evidence from rigorous experiments underscores the superior performance of object detectors trained on InstaGen's synthetic data. Remarkably, InstaGen has demonstrated significant improvements across various benchmarks, including:

  • Open-Vocabulary Detection: An increase in Average Precision (AP) by +4.5, showcasing its proficiency in expanding the detectable category horizon.
  • Data-Sparse Scenarios: An enhancement in AP ranging from +1.2 to +5.2, highlighting its capability to thrive even when real-world data is scarce.
  • Cross-Dataset Transfer: A boost in AP from +0.5 to +1.1, indicating its adaptability across different datasets.

These results, particularly in open-vocabulary and data-sparse detection, signal a new dawn in object detection where the limitations of data scarcity and annotation bottleneck are substantially mitigated.

The Technical Mastery Behind InstaGen

At the heart of InstaGen lies a sophisticated engineering marvel—the instance grounding head. This module ingeniously predicts bounding boxes by marrying visual features from the image synthesizer with text embeddings from category names. The approach meticulously crafts training triplets comprising visual features, bounding boxes, and text prompts, thereby enabling a seamless generation of synthetic datasets rich in diversity and complexity.

The Road Ahead

InstaGen not only presents a viable solution to the long-standing challenges faced by the object detection community but also opens a wealth of opportunities for future research. Its ability to generate diverse, high-quality synthetic datasets on demand heralds a promising avenue for advancing object detection technologies without the traditional constraints.

The implications of InstaGen extend beyond mere technological advancements; it embodies a paradigm shift towards more sustainable, efficient, and scalable methodologies in object detection training. As we stand on the brink of this new era, the potential for further innovations in leveraging synthetic data for artificial intelligence seems boundless.

In summation, InstaGen represents more than just an incremental step forward in object detection—it is a beacon of progress that illuminates the path toward overcoming the data-related hurdles that have long stifled innovation in this domain.