Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 16 tok/s Pro
GPT-4o 117 tok/s Pro
Kimi K2 200 tok/s Pro
GPT OSS 120B 469 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

UniGS: Unified Representation for Image Generation and Segmentation (2312.01985v1)

Published 4 Dec 2023 in cs.CV

Abstract: This paper introduces a novel unified representation of diffusion models for image generation and segmentation. Specifically, we use a colormap to represent entity-level masks, addressing the challenge of varying entity numbers while aligning the representation closely with the image RGB domain. Two novel modules, including the location-aware color palette and progressive dichotomy module, are proposed to support our mask representation. On the one hand, a location-aware palette guarantees the colors' consistency to entities' locations. On the other hand, the progressive dichotomy module can efficiently decode the synthesized colormap to high-quality entity-level masks in a depth-first binary search without knowing the cluster numbers. To tackle the issue of lacking large-scale segmentation training data, we employ an inpainting pipeline and then improve the flexibility of diffusion models across various tasks, including inpainting, image synthesis, referring segmentation, and entity segmentation. Comprehensive experiments validate the efficiency of our approach, demonstrating comparable segmentation mask quality to state-of-the-art and adaptability to multiple tasks. The code will be released at \href{https://github.com/qqlu/Entity}{https://github.com/qqlu/Entity}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Conditional image generation with score-based diffusion models. arXiv preprint arXiv:2111.13606, 2021.
  2. Large-scale interactive object segmentation with human annotators. In CVPR, 2019.
  3. Peekaboo: Text to image diffusion models are zero-shot segmentors. arXiv preprint arXiv:2211.13224, 2022.
  4. Pix2video: Video editing using image diffusion. In ICCV, 2023.
  5. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. arXiv preprint arXiv:2301.13826, 2023.
  6. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2017.
  7. Diffusiondet: Diffusion model for object detection. In ICCV, 2023a.
  8. A generalist framework for panoptic segmentation of images and videos. In ICCV, 2023b.
  9. Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023c.
  10. Masked-attention mask transformer for universal image segmentation. In CVPR, pages 1290–1299, 2022.
  11. General image-to-image translation with one-shot image guidance. In ICCV, 2023.
  12. Icm-3d: Instantiated category modeling for 3d instance segmentation. RAL, 2021.
  13. Come-closer-diffuse-faster: Accelerating conditional diffusion models for inverse problems through stochastic contraction. In CVPR, 2022.
  14. Diffusion models beat gans on image synthesis. In NeurlPS, 2021.
  15. Score-based generative modeling with critically-damped langevin diffusion. In ICLR, 2022.
  16. Instructdiffusion: A generalist modeling interface for vision tasks. arXiv preprint arXiv:2309.03895, 2023.
  17. Draw: A recurrent neural network for image generation. In ICML, 2015.
  18. Mask r-cnn. In ICCV, 2017.
  19. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  20. Denoising diffusion probabilistic models. In NeurlPS, 2020.
  21. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  22. Imagic: Text-based real image editing with diffusion models. In CVPR, 2023.
  23. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019.
  24. Segment anything. In ICCV, 2023.
  25. Salad: Part-level latent diffusion for 3d shape generation and manipulation. In ICCV, 2023.
  26. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
  27. Controllable text-to-image generation. NeurlPS, 2019.
  28. Bbdm: Image-to-image translation with brownian bridge diffusion models. In CVPR, 2023a.
  29. Mat: Mask-aware transformer for large hole image inpainting. In CVPR, 2022.
  30. Guiding text-to-image diffusion model towards grounded generation. In ICCV, 2023b.
  31. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
  32. Path aggregation network for instance segmentation. In CVPR, 2018.
  33. Vgdiffzero: Text-to-image diffusion models can be zero-shot visual grounders. arXiv preprint arXiv:2309.01141, 2023.
  34. Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 2022.
  35. Videofusion: Decomposed diffusion models for high-quality video generation. In CVPR, 2023.
  36. Diffusionseg: Adapting diffusion towards unsupervised object discovery. arXiv preprint arXiv:2303.09813, 2023.
  37. Latent-nerf for shape-guided generation of 3d shapes and textures. In CVPR, 2023.
  38. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  39. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2021.
  40. Amodal instance segmentation with kins dataset. In CVPR, 2019.
  41. Multi-scale aligned distillation for low-resolution detection. In CVPR, 2021a.
  42. Pointins: Point-based instance segmentation. TPAMI, 2021b.
  43. Open world entity segmentation. TAPMI, 2022.
  44. Aims: All-inclusive multi-level segmentation for anything. In NeurlPS, 2023a.
  45. High quality entity segmentation. In ICCV, 2023b.
  46. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  47. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  48. High quality segmentation for ultra high-resolution images. In CVPR, 2022.
  49. Guang Shu. Human detection, tracking and segmentation in surveillance video. 2014.
  50. D2c: Diffusion-decoding models for few-shot conditional generation. In NeurlPS, 2021.
  51. Denoising diffusion implicit models. In ICLR, 2021.
  52. Paul Suetens. Fundamentals of medical imaging. Cambridge university press, 2017.
  53. Dinar: Diffusion inpainting of neural textures for one-shot human avatars. In ICCV, 2023.
  54. Diffuse, attend, and segment: Unsupervised zero-shot segmentation using stable diffusion. arXiv preprint arXiv:2308.12469, 2023.
  55. Conditional convolutions for instance segmentation. In ECCV, 2020.
  56. Dual associated encoder for face restoration. arXiv preprint arXiv:2308.07314, 2023.
  57. Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, 2023.
  58. Neural discrete representation learning. In NeurlPS, 2017.
  59. Images speak in images: A generalist painter for in-context visual learning. In CVPR, 2023a.
  60. Seggpt: Segmenting everything in context. In ICCV, 2023b.
  61. Image synthesis via semantic composition. In ICCV, 2021.
  62. Palgan: Image colorization with palette generative adversarial networks. In ECCV, 2022.
  63. Hsr-diff: hyperspectral image super-resolution via conditional diffusion models. In ICCV, 2023a.
  64. Chen Henry Wu and Fernando De la Torre. A latent space of stochastic diffusion models for zero-shot image editing and guidance. In ICCV, 2023.
  65. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023b.
  66. Datasetdm: Synthesizing data with perception annotations using diffusion models. In NeurlPS, 2023c.
  67. Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. In ICCV, 2023d.
  68. Smartbrush: Text and shape guided object inpainting with diffusion model. In CVPR, 2023.
  69. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, 2023a.
  70. Geometric latent diffusion models for 3d molecule generation. In ICML, 2023b.
  71. Paint by example: Exemplar-based image editing with diffusion models. In CVPR, 2023.
  72. Adding conditional control to text-to-image diffusion models. In ICCV, 2023a.
  73. Sine: Single image editing with text-to-image diffusion models. In CVPR, 2023b.
  74. Image generation from layout. In CVPR, 2019.
  75. Pyramid scene parsing network. In CVPR, 2017.
Citations (10)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a unified framework that integrates image generation and segmentation using diffusion models and novel mask representation modules.
  • It employs a location-aware color palette and a progressive dichotomy module to ensure consistent and precise segmentation masks.
  • Experimental results demonstrate state-of-the-art image fidelity and segmentation accuracy, validated by metrics such as FID, CLIP, IoU, and recall.

Overview of UniGS: Unified Representation for Image Generation and Segmentation

The paper introduces UniGS, a framework that integrates image generation and segmentation into a unified representation via diffusion models. This is achieved by representing segmentation masks as colormaps, which aligns their representation more closely with RGB images. The goal is to address challenges such as varying numbers of entities while maintaining coherence between image and mask generation.

Technical Contributions

UniGS's framework proposes two novel modules to support the mask representation:

  1. Location-aware Color Palette: This module ensures that colors remain consistent with the entities' locations, facilitating discrimination among similar categories. It employs a grid-based approach, assigning each entity a fixed color based on its center-of-mass location to manage the complexity of distinguishing entities.
  2. Progressive Dichotomy Module (PDM): This module converts the colormap into high-quality entity-level masks via a depth-first binary search, without needing prior knowledge of the cluster numbers. PDM uses a pixel feature space combining RGB and LAB values to enhance segmentation accuracy, addressing issues like boundary noise and similar color differentiation.

Methodology and Pipeline

UniGS utilizes an inpainting pipeline to counter the lack of large-scale segmentation datasets, allowing flexibility across various tasks, such as inpainting, image synthesis, referring segmentation, and entity segmentation. This methodology enables the model to focus on specific regions, leveraging diverse segmentation datasets more efficiently.

A unified architecture based on latent diffusion models processes both image and mask generation flexibly, reducing computational demands through latent code operations. The framework's innovations allow for multitasking capabilities, improving both image fidelity and mask clarity.

Experimental Results

Extensive experiments demonstrate UniGS's efficiency, showcasing comparable segmentation quality to state-of-the-art models. The framework exhibits robust performance in both image quality (as measured by FID and CLIP scores) and segmentation accuracy (assessed through IoU and recall).

  • Inpainting and Image Synthesis: The results indicate significant improvements in integrating objects into scenes accurately, even capturing subtle features like shadows, showcasing UniGS's advanced understanding of spatial and textural contexts.
  • Referring and Entity Segmentation: Without explicit segmentation losses, UniGS achieved notable levels of segmentation accuracy, emphasizing its ability to align generated content with intended designs.

Implications and Future Directions

The development of UniGS presents significant implications for AI research, particularly in enhancing the coherence and realism of synthesized images. By unifying generation and segmentation within a single framework, UniGS has the potential to inspire new approaches in foundational models for dense prediction tasks.

Future directions involve exploring further integration of multiple tasks into singular models, improving efficiency and practicality within real-world applications. Additionally, extending the UniGS framework to other domains, such as video and 3D generation, may provide further opportunities for innovation.

Overall, UniGS represents a pivotal step towards more cohesive AI-generated content, bridging the gap between image creation and understanding through its unified design.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com