UniGS: Unified Representation for Image Generation and Segmentation (2312.01985v1)
Abstract: This paper introduces a novel unified representation of diffusion models for image generation and segmentation. Specifically, we use a colormap to represent entity-level masks, addressing the challenge of varying entity numbers while aligning the representation closely with the image RGB domain. Two novel modules, including the location-aware color palette and progressive dichotomy module, are proposed to support our mask representation. On the one hand, a location-aware palette guarantees the colors' consistency to entities' locations. On the other hand, the progressive dichotomy module can efficiently decode the synthesized colormap to high-quality entity-level masks in a depth-first binary search without knowing the cluster numbers. To tackle the issue of lacking large-scale segmentation training data, we employ an inpainting pipeline and then improve the flexibility of diffusion models across various tasks, including inpainting, image synthesis, referring segmentation, and entity segmentation. Comprehensive experiments validate the efficiency of our approach, demonstrating comparable segmentation mask quality to state-of-the-art and adaptability to multiple tasks. The code will be released at \href{https://github.com/qqlu/Entity}{https://github.com/qqlu/Entity}.
- Conditional image generation with score-based diffusion models. arXiv preprint arXiv:2111.13606, 2021.
- Large-scale interactive object segmentation with human annotators. In CVPR, 2019.
- Peekaboo: Text to image diffusion models are zero-shot segmentors. arXiv preprint arXiv:2211.13224, 2022.
- Pix2video: Video editing using image diffusion. In ICCV, 2023.
- Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. arXiv preprint arXiv:2301.13826, 2023.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2017.
- Diffusiondet: Diffusion model for object detection. In ICCV, 2023a.
- A generalist framework for panoptic segmentation of images and videos. In ICCV, 2023b.
- Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023c.
- Masked-attention mask transformer for universal image segmentation. In CVPR, pages 1290–1299, 2022.
- General image-to-image translation with one-shot image guidance. In ICCV, 2023.
- Icm-3d: Instantiated category modeling for 3d instance segmentation. RAL, 2021.
- Come-closer-diffuse-faster: Accelerating conditional diffusion models for inverse problems through stochastic contraction. In CVPR, 2022.
- Diffusion models beat gans on image synthesis. In NeurlPS, 2021.
- Score-based generative modeling with critically-damped langevin diffusion. In ICLR, 2022.
- Instructdiffusion: A generalist modeling interface for vision tasks. arXiv preprint arXiv:2309.03895, 2023.
- Draw: A recurrent neural network for image generation. In ICML, 2015.
- Mask r-cnn. In ICCV, 2017.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- Denoising diffusion probabilistic models. In NeurlPS, 2020.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
- Imagic: Text-based real image editing with diffusion models. In CVPR, 2023.
- An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019.
- Segment anything. In ICCV, 2023.
- Salad: Part-level latent diffusion for 3d shape generation and manipulation. In ICCV, 2023.
- The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
- Controllable text-to-image generation. NeurlPS, 2019.
- Bbdm: Image-to-image translation with brownian bridge diffusion models. In CVPR, 2023a.
- Mat: Mask-aware transformer for large hole image inpainting. In CVPR, 2022.
- Guiding text-to-image diffusion model towards grounded generation. In ICCV, 2023b.
- Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
- Path aggregation network for instance segmentation. In CVPR, 2018.
- Vgdiffzero: Text-to-image diffusion models can be zero-shot visual grounders. arXiv preprint arXiv:2309.01141, 2023.
- Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 2022.
- Videofusion: Decomposed diffusion models for high-quality video generation. In CVPR, 2023.
- Diffusionseg: Adapting diffusion towards unsupervised object discovery. arXiv preprint arXiv:2303.09813, 2023.
- Latent-nerf for shape-guided generation of 3d shapes and textures. In CVPR, 2023.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2021.
- Amodal instance segmentation with kins dataset. In CVPR, 2019.
- Multi-scale aligned distillation for low-resolution detection. In CVPR, 2021a.
- Pointins: Point-based instance segmentation. TPAMI, 2021b.
- Open world entity segmentation. TAPMI, 2022.
- Aims: All-inclusive multi-level segmentation for anything. In NeurlPS, 2023a.
- High quality entity segmentation. In ICCV, 2023b.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
- High quality segmentation for ultra high-resolution images. In CVPR, 2022.
- Guang Shu. Human detection, tracking and segmentation in surveillance video. 2014.
- D2c: Diffusion-decoding models for few-shot conditional generation. In NeurlPS, 2021.
- Denoising diffusion implicit models. In ICLR, 2021.
- Paul Suetens. Fundamentals of medical imaging. Cambridge university press, 2017.
- Dinar: Diffusion inpainting of neural textures for one-shot human avatars. In ICCV, 2023.
- Diffuse, attend, and segment: Unsupervised zero-shot segmentation using stable diffusion. arXiv preprint arXiv:2308.12469, 2023.
- Conditional convolutions for instance segmentation. In ECCV, 2020.
- Dual associated encoder for face restoration. arXiv preprint arXiv:2308.07314, 2023.
- Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, 2023.
- Neural discrete representation learning. In NeurlPS, 2017.
- Images speak in images: A generalist painter for in-context visual learning. In CVPR, 2023a.
- Seggpt: Segmenting everything in context. In ICCV, 2023b.
- Image synthesis via semantic composition. In ICCV, 2021.
- Palgan: Image colorization with palette generative adversarial networks. In ECCV, 2022.
- Hsr-diff: hyperspectral image super-resolution via conditional diffusion models. In ICCV, 2023a.
- Chen Henry Wu and Fernando De la Torre. A latent space of stochastic diffusion models for zero-shot image editing and guidance. In ICCV, 2023.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023b.
- Datasetdm: Synthesizing data with perception annotations using diffusion models. In NeurlPS, 2023c.
- Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. In ICCV, 2023d.
- Smartbrush: Text and shape guided object inpainting with diffusion model. In CVPR, 2023.
- Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, 2023a.
- Geometric latent diffusion models for 3d molecule generation. In ICML, 2023b.
- Paint by example: Exemplar-based image editing with diffusion models. In CVPR, 2023.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023a.
- Sine: Single image editing with text-to-image diffusion models. In CVPR, 2023b.
- Image generation from layout. In CVPR, 2019.
- Pyramid scene parsing network. In CVPR, 2017.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.