Generating Images with 3D Annotations Using Diffusion Models (2306.08103v4)
Abstract: Diffusion models have emerged as a powerful generative method, capable of producing stunning photo-realistic images from natural language descriptions. However, these models lack explicit control over the 3D structure in the generated images. Consequently, this hinders our ability to obtain detailed 3D annotations for the generated images or to craft instances with specific poses and distances. In this paper, we propose 3D Diffusion Style Transfer (3D-DST), which incorporates 3D geometry control into diffusion models. Our method exploits ControlNet, which extends diffusion models by using visual prompts in addition to text prompts. We generate images of the 3D objects taken from 3D shape repositories (e.g., ShapeNet and Objaverse), render them from a variety of poses and viewing directions, compute the edge maps of the rendered images, and use these edge maps as visual prompts to generate realistic images. With explicit 3D geometry control, we can easily change the 3D structures of the objects in the generated images and obtain ground-truth 3D annotations automatically. This allows us to improve a wide range of vision tasks, e.g., classification and 3D pose estimation, in both in-distribution (ID) and out-of-distribution (OOD) settings. We demonstrate the effectiveness of our method through extensive experiments on ImageNet-100/200, ImageNet-R, PASCAL3D+, ObjectNet3D, and OOD-CV. The results show that our method significantly outperforms existing methods, e.g., 3.8 percentage points on ImageNet-100 using DeiT-B.
- Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023.
- Label-efficient semantic segmentation with diffusion models. In ICLR, 2022.
- John Canny. A computational approach to edge detection. TPAMI, 8(6):679–698, 1986.
- Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Blender Online Community. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018. URL http://www.blender.org.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 702–703, 2020.
- Objaverse: A universe of annotated 3d objects. arXiv preprint arXiv:2212.08051, 2022.
- Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
- Flownet: Learning optical flow with convolutional networks. In ICCV, pp. 2758–2766, 2015.
- Kubric: A scalable dataset generator. In CVPR, pp. 3749–3761, 2022.
- Is synthetic data from generative models ready for image recognition? In ICLR, 2023.
- Augmix: A simple data processing method to improve robustness and uncertainty. In ICLR, 2020.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340–8349, 2021.
- Denoising diffusion probabilistic models. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), NeurIPS, 2020.
- Cascaded diffusion models for high fidelity image generation. JMLR, 23(47):1–33, 2022.
- Bigdatasetgan: Synthesizing imagenet with pixel-wise annotations. In CVPR, pp. 21330–21340, 2022.
- Explicit occlusion reasoning for multi-person 3d human pose estimation. In ECCV, pp. 497–517, 2022a.
- Generating face images with attributes for free. TNNLS, 32(6):2733–2743, 2021a.
- Online hyperparameter optimization for class-incremental learning. In AAAI, pp. 8906–8913, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021b.
- A convnet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022b.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Robust category-level 6d pose estimation with coarse-to-fine rendering of neural features. In European Conference on Computer Vision, pp. 492–508. Springer, 2022.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. TPAMI, 44(3):1623–1637, 2020.
- High-resolution image synthesis with latent diffusion models. In CVPR, pp. 10684–10695, 2022.
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp. 776–794. Springer, 2020.
- Training data-efficient image transformers distillation through attention. In International Conference on Machine Learning, volume 139, pp. 10347–10357, July 2021.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Effective data augmentation with diffusion models. arXiv preprint arXiv:2302.07944, 2023.
- Attention is all you need. NeurIPS, pp. 5998–6008, 2017.
- Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399, 2022.
- Neural view synthesis and matching for semi-supervised few-shot learning of 3d pose. In NeurIPS, pp. 7207–7219, 2021.
- Unsupervised learning of probably symmetric deformable 3d objects from images in the wild. In CVPR, pp. 1–10, 2020.
- Beyond pascal: A benchmark for 3d object detection in the wild. In WACV, pp. 75–82, 2014.
- Objectnet3d: A large scale database for 3d object recognition. In ECCV, pp. 160–176, 2016.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 6023–6032, 2019.
- mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
- Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
- Ood-cv: A benchmark for robustness to out-of-distribution shifts of individual nuisances in natural images. In ECCV, pp. 163–180, 2022.
- Ood-cv-v2: An extended benchmark for robustness to out-of-distribution shifts of individual nuisances in natural images. arXiv preprint arXiv:2304.10266, 2023.
- Structured3d: A large photo-realistic dataset for structured 3d modeling. In ECCV, pp. 519–535, 2020.
- Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020.
- Starmap for category-agnostic keypoint and viewpoint estimation. In ECCV, pp. 318–334, 2018.