MatteFormer: Transformer-Based Image Matting via Prior-Tokens (2203.15662v1)

Published 29 Mar 2022 in cs.CV

Abstract: In this paper, we propose a transformer-based image matting model called MatteFormer, which takes full advantage of trimap information in the transformer block. Our method first introduces a prior-token which is a global representation of each trimap region (e.g. foreground, background and unknown). These prior-tokens are used as global priors and participate in the self-attention mechanism of each block. Each stage of the encoder is composed of PAST (Prior-Attentive Swin Transformer) block, which is based on the Swin Transformer block, but differs in a couple of aspects: 1) It has PA-WSA (Prior-Attentive Window Self-Attention) layer, performing self-attention not only with spatial-tokens but also with prior-tokens. 2) It has prior-memory which saves prior-tokens accumulatively from the previous blocks and transfers them to the next block. We evaluate our MatteFormer on the commonly used image matting datasets: Composition-1k and Distinctions-646. Experiment results show that our proposed method achieves state-of-the-art performance with a large margin. Our codes are available at https://github.com/webtoon/matteformer.

Citations (58)

View on Semantic Scholar

Summary

The paper introduces prior-tokens within a transformer framework to encode global trimap information for precise image matting.
It employs Prior-Attentive Swin Transformer blocks with a prior-memory mechanism to integrate multi-scale contextual features effectively.
Evaluated on benchmark datasets, MatteFormer outperforms conventional CNN approaches, achieving a state-of-the-art SAD of 23.8 for refined segmentation.

Transformer-Based Image Matting: The MatteFormer Approach

The paper "MatteFormer: Transformer-Based Image Matting via Prior-Tokens" presents a transformer-based model specifically designed for image matting tasks, a fundamental operation in computer vision to extract foreground objects from complex backgrounds. The authors introduce MatteFormer, which leverages a novel concept of prior-tokens within a transformer framework to effectively utilize trimap information—an auxiliary input consisting of three regions: foreground, background, and unknown.

Methodological Innovations

MatteFormer builds on the Swin Transformer architecture, an efficient vision transformer model known for its hierarchical structure and local window self-attention mechanism. To address the limitation of receptive field size inherent in local self-attention mechanisms, MatteFormer incorporates Prior-Attentive Swin Transformer (PAST) blocks. These blocks integrate prior-tokens, which are global representations extracted from each trimap region. These tokens are generated by averaging the features within each defined region, thus allowing the model to retain comprehensive global context and participate in the self-attention processes of each block.

Significantly, PAST blocks also include a prior-memory mechanism that accumulates previous blocks' prior-tokens, thereby enriching the self-attentional capacity with temporal depth as well as spatial breadth. This design enhances the capacity of MatteFormer to propagate meaningful global context through subsequent layers and stages of the model, a critical requirement for precise image matting.

Strong Numerical Results

The model was evaluated on the standard datasets for image matting, including Composition-1k and Distinctions-646, achieving state-of-the-art performance results across various metrics such as SAD (sum of absolute differences), MSE (mean square error), Grad, and Conn. Notably, MatteFormer exhibited superior capabilities with a SAD of 23.8, outperforming previous models that relied on CNN architectures. The introduction of prior-tokens proved to be a pivotal innovation, with experimental results confirming that these tokens effectively enable the model to make more precise global and local context distinctions.

Implications and Future Directions

By integrating transformer-based architectures with the concept of prior-tokens, MatteFormer marks an advancement in the application of transformers in image matting tasks, thus offering a promising alternative to traditional CNN-based approaches. The architecture showcases how transformers can be adapted beyond NLP to address complex visual tasks, exploiting their capacity to encode contextual relationships across large spatial areas more efficiently. The inclusion of the prior-memory mechanism represents a significant step forward in retaining pertinent context across the encoding stages, highlighting the flexible nature of transformers in representing multi-scale information.

In considering future developments, an intriguing area of exploration is the extension of MatteFormer's transformer-based model to trimap-free methodologies, which could further streamline the workflow of image matting applications. Additionally, fully transitioning the decoder to a transformer-based architecture remains a viable area to enhance the end-to-end learning process, potentially improving feature decoding with attention mechanisms alongside the strengthened encoding using PAST blocks.

Overall, MatteFormer demonstrates the potential of transformers in vision-based tasks, providing a robust framework for future research either to improve image matting or adapt to other applications requiring precise object extraction and segmentation.

PDF Markdown

Related Papers

GitHub

GitHub - webtoon/matteformer (219 stars)