A Generalist Framework for Panoptic Segmentation of Images and Videos (2210.06366v4)
Abstract: Panoptic segmentation assigns semantic and instance ID labels to every pixel of an image. As permutations of instance IDs are also valid solutions, the task requires learning of high-dimensional one-to-many mapping. As a result, state-of-the-art approaches use customized architectures and task-specific loss functions. We formulate panoptic segmentation as a discrete data generation problem, without relying on inductive bias of the task. A diffusion model is proposed to model panoptic masks, with a simple architecture and generic loss function. By simply adding past predictions as a conditioning signal, our method is capable of modeling video (in a streaming setting) and thereby learns to track object instances automatically. With extensive experiments, we demonstrate that our simple approach can perform competitively to state-of-the-art specialist methods in similar settings.
- Segdiff: Image segmentation with diffusion probabilistic models. arXiv preprint arXiv:2112.00390, 2021.
- Decoder denoising pretraining for semantic segmentation. In Computer Vision and Pattern Recognition workshop, 2022.
- Stem-seg: Spatio-temporal embeddings for instance segmentation in videos. In European Conference on Computer Vision, pages 158–177. Springer, 2020.
- Label-efficient semantic segmentation with diffusion models. arXiv preprint arXiv:2112.03126, 2021.
- Classifying, segmenting, and tracking object instances in video with mask propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9739–9748, 2020.
- Sipmask: Spatial information preservation for fast image and video instance segmentation. In European Conference on Computer Vision, pages 1–18. Springer, 2020.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306, 2021.
- Scaling wide residual networks for panoptic segmentation. arXiv:2011.11675, 2020.
- Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852, 2021.
- A unified sequence interface for vision tasks. arXiv preprint arXiv:2206.07669, 2022.
- Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022.
- Mask2former for video instance segmentation. arXiv preprint arXiv:2112.10764, 2021.
- Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12475–12485, 2020.
- Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation. In CVPR, 2020.
- Masked-attention mask transformer for universal image segmentation. CVPR, 2022.
- Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021.
- François Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, 2017.
- The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2918–2928, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Jeremy Howard. Training imagenet in 3 hours for 25 minutes. https://www.fast.ai/2018/04/30/dawnbench-fastai/, 2018.
- Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR, 2021.
- Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41, 2022.
- Diffusion adversarial representation learning for self-supervised vessel segmentation. arXiv preprint arXiv:2209.14566, 2022.
- Video panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9859–9868, 2020.
- Tubeformer-deeplab: Video mask transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13914–13924, 2022.
- Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9404–9413, 2019.
- Uvim: A unified modeling approach for vision with learned guiding codes. arXiv preprint arXiv:2205.10337, 2022.
- Mast: A memory-augmented self-supervised tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- Mask dino: Towards a unified transformer-based framework for object detection and segmentation. arXiv preprint arXiv:2206.02777, 2022.
- Attention-guided unified network for panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7026–7035, 2019.
- Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1280–1289, 2022.
- Video instance segmentation tracking with a modified vae architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13147–13157, 2020.
- Video instance segmentation with a propose-reduce paradigm. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1739–1748, October 2021.
- Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
- Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
- An end-to-end network for panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6172–6181, 2019.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
- A convnet for the 2020s. CVPR, 2022.
- Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916v2, 2022.
- Unovost: Unsupervised offline video object segmentation and tracking. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), March 2020.
- Improved denoising diffusion probabilistic models. arXiv 2102.09672, 2021.
- A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016.
- U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
- Palette: Image-to-image diffusion models. SIGGRAPH, 2022.
- Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
- Efficientnetv2: Smaller models and faster training. In International Conference on Machine Learning, pages 10096–10106. PMLR, 2021.
- Fixing the train-test resolution discrepancy. Advances in neural information processing systems, 32, 2019.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Rvos: End-to-end recurrent network for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5277–5286, 2019.
- Mots: Multi-object tracking and segmentation. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 7942–7951, 2019.
- Max-deeplab: End-to-end panoptic segmentation with mask transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5463–5474, 2021.
- Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. In ECCV, 2020.
- Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052, 2022.
- A survey on deep learning technique for video segmentation. arXiv preprint arXiv:2107.01153, 2021.
- End-to-end video instance segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8741–8750, 2021.
- Step: Segmenting and tracking every pixel. arXiv preprint arXiv:2102.11859, 2021.
- Diffusion models for implicit image segmentation ensembles. arXiv preprint arXiv:2112.03145, 2021.
- Detectron2. https://github.com/facebookresearch/detectron2, 2019.
- Upsnet: A unified panoptic segmentation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8818–8826, 2019.
- Video instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5188–5197, 2019.
- Cmt-deeplab: Clustering mask transformers for panoptic segmentation. In CVPR, 2022.
- k-means mask transformer, 2022.
- K-net: Towards unified image segmentation. Advances in Neural Information Processing Systems, 34:10326–10338, 2021.
- Nested hierarchical transformer: Towards accurate, data-efficient and interpretable visual understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3417–3425, 2022.