Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation (2206.02777v3)

Published 6 Jun 2022 in cs.CV

Abstract: In this paper we present Mask DINO, a unified object detection and segmentation framework. Mask DINO extends DINO (DETR with Improved Denoising Anchor Boxes) by adding a mask prediction branch which supports all image segmentation tasks (instance, panoptic, and semantic). It makes use of the query embeddings from DINO to dot-product a high-resolution pixel embedding map to predict a set of binary masks. Some key components in DINO are extended for segmentation through a shared architecture and training process. Mask DINO is simple, efficient, and scalable, and it can benefit from joint large-scale detection and segmentation datasets. Our experiments show that Mask DINO significantly outperforms all existing specialized segmentation methods, both on a ResNet-50 backbone and a pre-trained model with SwinL backbone. Notably, Mask DINO establishes the best results to date on instance segmentation (54.5 AP on COCO), panoptic segmentation (59.4 PQ on COCO), and semantic segmentation (60.8 mIoU on ADE20K) among models under one billion parameters. Code is available at \url{https://github.com/IDEACVR/MaskDINO}.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Feng Li (286 papers)
  2. Hao Zhang (948 papers)
  3. Shilong Liu (60 papers)
  4. Lei Zhang (1689 papers)
  5. Lionel M. Ni (20 papers)
  6. Heung-Yeung Shum (32 papers)
  7. Huaizhe Xu (6 papers)
Citations (308)

Summary

Overview of "Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation"

The paper presents Mask DINO, an extended framework of DINO (DETR with Improved Denoising Anchor Boxes) designed to unify the tasks of object detection and segmentation within a single Transformer-based model. Mask DINO addresses the challenges of combining instance, panoptic, and semantic segmentation tasks alongside object detection, creating a versatile approach capable of leveraging joint large-scale datasets effectively.

Core Contributions

Mask DINO builds upon the successful DINO architecture by integrating a mask prediction branch that extends its functionality to segmentation tasks. This integration is achieved by utilizing the query embeddings from DINO, facilitating the prediction of binary masks through a dot-product operation with a high-resolution pixel embedding map.

Key advancements in Mask DINO include:

  1. Unified Query Selection and Enhanced Initialization: By utilizing encoder outputs for initialization of both content queries and anchor boxes, Mask DINO enhances segmentation accuracy. It further introduces mask-enhanced anchor box initialization to improve detection performance.
  2. Unified Denoising for Masks: Extending DINO's denoising training, Mask DINO employs a denoising task specifically for mask predictions, thus accelerating convergence and efficiency in segmentation training.
  3. Hybrid Matching: Incorporating mask prediction loss into the matching process improves the alignment and consistency between predicted boxes and masks, enhancing the overall robustness of the model.
  4. Decoupled Box Prediction: For panoptic segmentation, Mask DINO introduces a decoupling strategy for "stuff" categories, improving both training efficiency and the quality of segmentation outcomes.

Numerical Results and Impact

The paper outlines significant improvements over existing specialized segmentation methods. With a ResNet-50 backbone, Mask DINO surpassed previous models by achieving 54.5 AP on COCO instance segmentation, 59.4 PQ on panoptic segmentation, and 60.8 mIoU on ADE20K semantic segmentation. These results affirm the efficacy of Mask DINO in handling tasks within a unified model architecture, offering a substantial increase in performance metrics across various segmentation benchmarks.

Implications and Future Directions

Mask DINO's framework demonstrates the potential for unified models to simplify visual task processing while enhancing accuracy and performance. The integration of detection and segmentation into a single architecture opens pathways for leveraging large-scale datasets across multiple tasks, providing a more streamlined approach to model development and deployment.

From a practical perspective, Mask DINO's robust performance suggests significant applicability in real-world systems where efficiency and accuracy in both detection and segmentation are paramount. Theoretical advancements may focus on exploring further unification strategies, including more complex visual tasks beyond segmentation and detection.

In future research directions, investigating methods to reduce computational overhead in large models without compromising performance could prove beneficial. Additionally, exploring the scalability of Mask DINO to other domains and datasets may broaden the applicability of this unified framework.

Overall, Mask DINO marks an important stride towards creating versatile models capable of serving a wide array of visual tasks effectively, highlighting an exciting direction for transformer-based vision systems.