DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation (2011.09876v3)

Published 19 Nov 2020 in cs.CV

Abstract: Binary grid mask representation is broadly used in instance segmentation. A representative instantiation is Mask R-CNN which predicts masks on a $28\times 28$ binary grid. Generally, a low-resolution grid is not sufficient to capture the details, while a high-resolution grid dramatically increases the training complexity. In this paper, we propose a new mask representation by applying the discrete cosine transform(DCT) to encode the high-resolution binary grid mask into a compact vector. Our method, termed DCT-Mask, could be easily integrated into most pixel-based instance segmentation methods. Without any bells and whistles, DCT-Mask yields significant gains on different frameworks, backbones, datasets, and training schedules. It does not require any pre-processing or pre-training, and almost no harm to the running speed. Especially, for higher-quality annotations and more complex backbones, our method has a greater improvement. Moreover, we analyze the performance of our method from the perspective of the quality of mask representation. The main reason why DCT-Mask works well is that it obtains a high-quality mask representation with low complexity. Code is available at https://github.com/aliyun/DCT-Mask.git.

Citations (49)

View on Semantic Scholar

Summary

An Overview of DCT-Mask for Instance Segmentation

The paper "DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation" presents a novel method for mask representation in instance segmentation tasks, leveraging the strengths of the discrete cosine transform (DCT) to enhance mask quality while reducing computational complexity. This approach, termed DCT-Mask, addresses the limitations of traditional binary grid mask representations, such as those employed in Mask R-CNN, which often require balancing between resolution and training complexity.

Key Contributions:

DCT-Based Mask Representation: The primary innovation of this work is the use of DCT to encode high-resolution binary masks into compact vectors. This transformation focuses on preserving low-frequency components, which encapsulate the majority of the mask information, thereby achieving a high-quality representation with reduced dimensionality. For example, experiments demonstrated that a $128 \times 128$ mask could be effectively compressed into a 300-dimensional vector, maintaining a mask quality of 97% intersection over union (IoU).
Integration into Existing Frameworks: DCT-Mask is designed to seamlessly integrate with existing pixel-based instance segmentation frameworks, such as Mask R-CNN. The integration involves minimal architectural changes, specifically adjusting the mask prediction branch to output the DCT vector instead of a binary grid. The computational complexity of the DCT transformation (O(nlog(n))) is negligible in the overall framework, making it a feasible option for large-scale applications.
Performance Improvements: Across multiple datasets, including COCO, LVIS*, and Cityscapes, DCT-Mask consistently outperformed standard Mask R-CNN, demonstrating significant gains in mask average precision (AP). Specifically, DCT-Mask achieved a 1.3% increase in AP on COCO and 2.1% on LVIS*, with more pronounced improvements observed on high-resolution datasets.

Implications for Instance Segmentation:

Efficiency and Scalability: By employing DCT for mask representation, this technique significantly optimizes computational resources, making it suitable for applications demanding real-time processing or handling large-scale data. DCT-Mask maintains competitive performance without necessitating hyper-resolution masks that increase training complexity.
Potential for High-Quality Annotations: The approach is particularly advantageous when dealing with datasets that feature high-quality annotations, allowing for more precise segmentation results that accurately capture object boundaries and finer details.

Future Directions:

The promising results of DCT-Mask suggest several avenues for future research. An exploration into adaptive DCT configurations or combining DCT with other frequency domain techniques could further refine mask quality and representation efficiency. Additionally, extending the DCT-Mask concept to other domains within computer vision, such as semantic segmentation or image-to-image translation, may yield interesting insights and advancements.

Overall, the integration of DCT into mask representation frameworks offers a compelling strategy to enhance instance segmentation performance, striking a balance between resolution and computational feasibility. The paper serves as a valuable contribution to ongoing efforts in the field, offering practical insights and robust methodologies applicable to a range of computer vision tasks.

Related Papers

Mask R-CNN (2017)
Boundary-preserving Mask R-CNN (2020)
DynaMask: Dynamic Mask Selection for Instance Segmentation (2023)
PatchDCT: Patch Refinement for High Quality Instance Segmentation (2023)
The surprising impact of mask-head architecture on novel class segmentation (2021)

GitHub

GitHub - aliyun/DCT-Mask (83 stars)