Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Generalist Framework for Panoptic Segmentation of Images and Videos (2210.06366v4)

Published 12 Oct 2022 in cs.CV, cs.AI, cs.LG, and cs.MM

Abstract: Panoptic segmentation assigns semantic and instance ID labels to every pixel of an image. As permutations of instance IDs are also valid solutions, the task requires learning of high-dimensional one-to-many mapping. As a result, state-of-the-art approaches use customized architectures and task-specific loss functions. We formulate panoptic segmentation as a discrete data generation problem, without relying on inductive bias of the task. A diffusion model is proposed to model panoptic masks, with a simple architecture and generic loss function. By simply adding past predictions as a conditioning signal, our method is capable of modeling video (in a streaming setting) and thereby learns to track object instances automatically. With extensive experiments, we demonstrate that our simple approach can perform competitively to state-of-the-art specialist methods in similar settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Segdiff: Image segmentation with diffusion probabilistic models. arXiv preprint arXiv:2112.00390, 2021.
  2. Decoder denoising pretraining for semantic segmentation. In Computer Vision and Pattern Recognition workshop, 2022.
  3. Stem-seg: Spatio-temporal embeddings for instance segmentation in videos. In European Conference on Computer Vision, pages 158–177. Springer, 2020.
  4. Label-efficient semantic segmentation with diffusion models. arXiv preprint arXiv:2112.03126, 2021.
  5. Classifying, segmenting, and tracking object instances in video with mask propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9739–9748, 2020.
  6. Sipmask: Spatial information preservation for fast image and video instance segmentation. In European Conference on Computer Vision, pages 1–18. Springer, 2020.
  7. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  8. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306, 2021.
  9. Scaling wide residual networks for panoptic segmentation. arXiv:2011.11675, 2020.
  10. Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852, 2021.
  11. A unified sequence interface for vision tasks. arXiv preprint arXiv:2206.07669, 2022.
  12. Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022.
  13. Mask2former for video instance segmentation. arXiv preprint arXiv:2112.10764, 2021.
  14. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12475–12485, 2020.
  15. Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation. In CVPR, 2020.
  16. Masked-attention mask transformer for universal image segmentation. CVPR, 2022.
  17. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021.
  18. François Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, 2017.
  19. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  20. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  21. Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2918–2928, 2021.
  22. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  23. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  24. Jeremy Howard. Training imagenet in 3 hours for 25 minutes. https://www.fast.ai/2018/04/30/dawnbench-fastai/, 2018.
  25. Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR, 2021.
  26. Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41, 2022.
  27. Diffusion adversarial representation learning for self-supervised vessel segmentation. arXiv preprint arXiv:2209.14566, 2022.
  28. Video panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9859–9868, 2020.
  29. Tubeformer-deeplab: Video mask transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13914–13924, 2022.
  30. Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9404–9413, 2019.
  31. Uvim: A unified modeling approach for vision with learned guiding codes. arXiv preprint arXiv:2205.10337, 2022.
  32. Mast: A memory-augmented self-supervised tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  33. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. arXiv preprint arXiv:2206.02777, 2022.
  34. Attention-guided unified network for panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7026–7035, 2019.
  35. Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1280–1289, 2022.
  36. Video instance segmentation tracking with a modified vae architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13147–13157, 2020.
  37. Video instance segmentation with a propose-reduce paradigm. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1739–1748, October 2021.
  38. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  39. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  40. An end-to-end network for panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6172–6181, 2019.
  41. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
  42. A convnet for the 2020s. CVPR, 2022.
  43. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916v2, 2022.
  44. Unovost: Unsupervised offline video object segmentation and tracking. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), March 2020.
  45. Improved denoising diffusion probabilistic models. arXiv 2102.09672, 2021.
  46. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016.
  47. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  48. Palette: Image-to-image diffusion models. SIGGRAPH, 2022.
  49. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019.
  50. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  51. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  52. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
  53. Efficientnetv2: Smaller models and faster training. In International Conference on Machine Learning, pages 10096–10106. PMLR, 2021.
  54. Fixing the train-test resolution discrepancy. Advances in neural information processing systems, 32, 2019.
  55. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  56. Rvos: End-to-end recurrent network for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5277–5286, 2019.
  57. Mots: Multi-object tracking and segmentation. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 7942–7951, 2019.
  58. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5463–5474, 2021.
  59. Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. In ECCV, 2020.
  60. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052, 2022.
  61. A survey on deep learning technique for video segmentation. arXiv preprint arXiv:2107.01153, 2021.
  62. End-to-end video instance segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8741–8750, 2021.
  63. Step: Segmenting and tracking every pixel. arXiv preprint arXiv:2102.11859, 2021.
  64. Diffusion models for implicit image segmentation ensembles. arXiv preprint arXiv:2112.03145, 2021.
  65. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
  66. Upsnet: A unified panoptic segmentation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8818–8826, 2019.
  67. Video instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5188–5197, 2019.
  68. Cmt-deeplab: Clustering mask transformers for panoptic segmentation. In CVPR, 2022.
  69. k-means mask transformer, 2022.
  70. K-net: Towards unified image segmentation. Advances in Neural Information Processing Systems, 34:10326–10338, 2021.
  71. Nested hierarchical transformer: Towards accurate, data-efficient and interpretable visual understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3417–3425, 2022.
Citations (88)

Summary

We haven't generated a summary for this paper yet.