DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models (2303.11681v4)
Abstract: Collecting and annotating images with pixel-wise labels is time-consuming and laborious. In contrast, synthetic data can be freely available using a generative model (e.g., DALL-E, Stable Diffusion). In this paper, we show that it is possible to automatically obtain accurate semantic masks of synthetic images generated by the Off-the-shelf Stable Diffusion model, which uses only text-image pairs during training. Our approach, called DiffuMask, exploits the potential of the cross-attention map between text and image, which is natural and seamless to extend the text-driven image synthesis to semantic mask generation. DiffuMask uses text-guided cross-attention information to localize class/word-specific regions, which are combined with practical techniques to create a novel high-resolution and class-discriminative pixel-wise mask. The methods help to reduce data collection and annotation costs obviously. Experiments demonstrate that the existing segmentation methods trained on synthetic data of DiffuMask can achieve a competitive performance over the counterpart of real data (VOC 2012, Cityscapes). For some classes (e.g., bird), DiffuMask presents promising performance, close to the stateof-the-art result of real data (within 3% mIoU gap). Moreover, in the open-vocabulary segmentation (zero-shot) setting, DiffuMask achieves a new SOTA result on Unseen class of VOC 2012. The project website can be found at https://weijiawu.github.io/DiffusionMask/.
- Efficient interactive annotation of segmentation datasets with polygon-rnn++. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 859–868, 2018.
- Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4981–4990, 2018.
- Towards single stage weakly supervised semantic segmentation. arXiv preprint arXiv:2106.10309, 2021.
- Exploiting a joint embedding space for generalized zero-shot semantic segmentation. In Proc. ICCV, 2021.
- Romain Beaumont. Clip retrieval: Easily compute clip embeddings and build a clip retrieval system with them. GitHub, 2022.
- This dataset does not exist: training models from generated images. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2020.
- Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.
- Zero-shot semantic segmentation. NeurIPS, 2019.
- Beat the mturkers: Automatic image labeling from weak 3d supervision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3198–3205, 2014.
- Understanding and utilizing deep neural networks trained with noisy labels. In International Conference on Machine Learning, pages 1062–1070. PMLR, 2019.
- Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022.
- Sign: Spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In Proc. ICCV, 2021.
- Self-ensembling with gan-based data augmentation for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6830–6840, 2019.
- The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
- Unsupervised learning of scene structure for synthetic data generation, Sept. 9 2021. US Patent App. 17/117,425.
- Decoupling zero-shot semantic segmentation. In Proc. CVPR, 2022.
- Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
- The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
- Neural-sim: Learning to generate training data with nerf. In European Conference on Computer Vision, pages 477–493. Springer, 2022.
- Dall-e for detection: Language-driven context image synthesis for object detection. arXiv preprint arXiv:2206.09592, 2022.
- Training deep neural-networks using a noise adaptation layer. In Proc. Int. Conf. Learn. Representations, 2017.
- Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
- Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. arXiv preprint arXiv:2305.18292, 2023.
- Context-aware feature generation for zero-shot semantic segmentation. In ACM MM, 2020.
- Imagenet auto-annotation with segmentation propagation. International Journal of Computer Vision, 110(3):328–348, 2014.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022.
- Meta-sim: Learning to generate synthetic datasets. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4551–4560, 2019.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Efficient inference in fully connected crfs with gaussian edge potentials. Advances in neural information processing systems, 24, 2011.
- Box2seg: Attention weighted loss and discriminative feature learning for weakly supervised segmentation. In European Conference on Computer Vision, pages 290–308. Springer, 2020.
- Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4071–4080, 2021.
- Bbam: Bounding box attribution map for weakly supervised semantic and instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2643–2652, 2021.
- Bigdatasetgan: Synthesizing imagenet with pixel-wise annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21330–21340, 2022.
- Guiding text-to-image diffusion model towards grounded generation. arXiv preprint arXiv:2301.05221, 2023.
- Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3159–3167, 2016.
- Microsoft coco: Common objects in context. In Proc. ECCV, 2014.
- Fast interactive object annotation with curve-gcn. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5257–5266, 2019.
- Improving robustness without sacrificing accuracy with patch gaussian augmentation. arXiv preprint arXiv:1906.02611, 2019.
- Learning with noisy labels. 2013.
- Imaginarynet: Learning object detectors without real images and annotations. arXiv preprint arXiv:2210.06886, 2022.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70:1373–1411, 2021.
- A closer look at self-training for zero-label semantic segmentation. In Proc. CVPRW, 2021.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Learning visual words for weakly-supervised semantic segmentation. In IJCAI, volume 5, page 6, 2021.
- Learning affinity from attention: End-to-end weakly-supervised semantic segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16846–16855, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
- Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
- Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
- Unsupervised domain adaptation in semantic segmentation: a review. Technologies, 8(2):35, 2020.
- Investigating prompt engineering in diffusion models. arXiv preprint arXiv:2211.15462, 2022.
- Embedded discriminative attention mechanism for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16765–16774, 2021.
- Synthetic data supervised salient object detection. In Proceedings of the 30th ACM International Conference on Multimedia, pages 5557–5565, 2022.
- Leveraging auxiliary tasks with affinity learning for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6984–6993, 2021.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
- Affinity attention graph neural network for weakly supervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- Image gans meet differentiable rendering for inverse graphics and interpretable 3d neural rendering. arXiv preprint arXiv:2010.09125, 2020.
- Datasetgan: Efficient labeled data factory with minimal human effort. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10145–10155, 2021.
- Flowtext: Synthesizing realistic scene text video with optical flow estimation. arXiv preprint arXiv:2305.03327, 2023.
- Generative prompt model for weakly supervised object localization. arXiv preprint arXiv:2307.09756, 2023.
- Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
- Weijia Wu (47 papers)
- Yuzhong Zhao (18 papers)
- Mike Zheng Shou (165 papers)
- Hong Zhou (61 papers)
- Chunhua Shen (404 papers)