Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting (2401.10227v2)

Published 18 Jan 2024 in cs.CV and cs.LG

Abstract: Panoptic and instance segmentation networks are often trained with specialized object detection modules, complex loss functions, and ad-hoc post-processing steps to manage the permutation-invariance of the instance masks. This work builds upon Stable Diffusion and proposes a latent diffusion approach for panoptic segmentation, resulting in a simple architecture that omits these complexities. Our training consists of two steps: (1) training a shallow autoencoder to project the segmentation masks to latent space; (2) training a diffusion model to allow image-conditioned sampling in latent space. This generative approach unlocks the exploration of mask completion or inpainting. The experimental validation on COCO and ADE20k yields strong segmentation results. Finally, we demonstrate our model's adaptability to multi-tasking by introducing learnable task embeddings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (85)
  1. Augmented reality meets computer vision: Efficient data generation for urban driving scenes. International Journal of Computer Vision (IJCV), 2018.
  2. Segdiff: Image segmentation with diffusion probabilistic models. arXiv preprint arXiv:2112.00390, 2021.
  3. Decoder denoising pretraining for semantic segmentation. arXiv preprint arXiv:2205.11423, 2022.
  4. Visual prompting via image inpainting. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  5. Label-efficient semantic segmentation with diffusion models. In International Conference on Learning Representations (ICLR), 2022.
  6. Soft-nms–improving object detection with one line of code. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  7. Cascade r-cnn: Delving into high quality object detection. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  8. End-to-end object detection with transformers. In European Conference on Computer Vision (ECCV), 2020.
  9. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  10. Pix2seq: A language modeling framework for object detection. In International Conference on Learning Representations (ICLR), 2022a.
  11. A unified sequence interface for vision tasks. In Advances in Neural Information Processing Systems, 2022b.
  12. A generalist framework for panoptic segmentation of images and videos. In International Conference on Computer Vision (ICCV), 2023a.
  13. Analog bits: Generating discrete data using diffusion models with self-conditioning. In International Conference on Learning Representations (ICLR), 2023b.
  14. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  15. Per-pixel classification is not all you need for semantic segmentation. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  16. Masked-attention mask transformer for universal image segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  17. Agriculture-vision: A large aerial image database for agricultural pattern analysis. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  18. Picie: Unsupervised semantic segmentation using invariance and equivariance in clustering. In CVPR, 2021.
  19. The cityscapes dataset for semantic urban scene understanding. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  20. Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  21. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
  22. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 107:3–11, 2018.
  23. Taming transformers for high-resolution image synthesis. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  24. Hugging Face. Compvis/stable-diffusion-v1-4, 2023. Retrieved September 15, 2023.
  25. Simple copy-paste is a strong data augmentation method for instance segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  26. Ego4d: Around the world in 3,000 hours of egocentric video. arXiv preprint arXiv:2110.07058, 2021.
  27. Diffusioninst: Diffusion model for instance segmentation. arXiv preprint arXiv:2212.02773, 2022.
  28. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  29. Mask r-cnn. In International Conference on Computer Vision (ICCV), 2017.
  30. Masked autoencoders are scalable vision learners. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  31. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  32. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  33. Learning non-maximum suppression. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  34. Dynamic filter networks. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
  35. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision (ECCV), 2016.
  36. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR), 2014.
  37. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019a.
  38. Panoptic segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019b.
  39. Pointrend: Image segmentation as rendering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  40. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  41. Uvim: A unified modeling approach for vision with learned guiding codes. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  42. Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
  43. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning (ICML), 2022a.
  44. Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision (ECCV), 2022b.
  45. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), 2014.
  46. Swin transformer: Hierarchical vision transformer using shifted windows. In International Conference on Computer Vision (ICCV), 2021.
  47. Fully convolutional networks for semantic segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  48. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  49. Unified-io: A unified model for vision, language, and multi-modal tasks. In International Conference on Learning Representations (ICLR), 2023.
  50. The multimodal brain tumor image segmentation benchmark (brats). IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 2014.
  51. Image segmentation using deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 2021.
  52. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning (ICML), 2022.
  53. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  54. Automatic differentiation in pytorch. 2017.
  55. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 2021.
  56. Zero-shot text-to-image generation. In International Conference on Machine Learning (ICML), 2021.
  57. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  58. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NeurIPS), 2015.
  59. High-resolution image synthesis with latent diffusion models. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  60. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention, 2015.
  61. Objects365: A large-scale, high-quality dataset for object detection. In International Conference on Computer Vision (ICCV), 2019.
  62. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (ICML), 2015.
  63. Denoising diffusion implicit models. In International Conference on Learning Representations (ICLR), 2021.
  64. Improved techniques for training score-based generative models. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  65. Consistency models. In International Conference on Machine Learning (ICML), 2023.
  66. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, 2017.
  67. Conditional convolutions for instance segmentation. In European Conference on Computer Vision (ECCV), 2020.
  68. Pixel recurrent neural networks. In International Conference on Machine Learning (ICML), 2016.
  69. Neural discrete representation learning. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  70. Revisiting contrastive methods for unsupervised learning of visual representations. In Advances in Neural Information Processing Systems (NeurIPS), 2021a.
  71. Unsupervised semantic segmentation by contrasting object mask proposals. In International Conference on Computer Vision (ICCV), 2021b.
  72. Discovering object masks with transformers for unsupervised semantic segmentation. arXiv preprint arXiv:2206.06363, 2022.
  73. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  74. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  75. Solov2: Dynamic and fast instance segmentation. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  76. Freesolo: Learning to segment objects without annotations. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  77. Cut and learn for unsupervised object detection and instance segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
  78. Images speak in images: A generalist painter for in-context visual learning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
  79. Upsnet: A unified panoptic segmentation network. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  80. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  81. k-means mask transformer. In European Conference on Computer Vision (ECCV), 2022.
  82. Adding conditional control to text-to-image diffusion models. In International Conference on Computer Vision (ICCV), 2023.
  83. The unreasonable effectiveness of deep features as a perceptual metric. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  84. K-net: Towards unified image segmentation. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  85. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision (IJCV), 2019.
Citations (10)

Summary

We haven't generated a summary for this paper yet.