Denoising Vision Transformers (2401.02957v2)

Published 5 Jan 2024 in cs.CV

Abstract: We study a crucial yet often overlooked issue inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which hurt the performance of ViTs in downstream dense prediction tasks such as semantic segmentation, depth prediction, and object discovery. We trace this issue down to the positional embeddings at the input stage. To mitigate this, we propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT). In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean feature estimates for offline applications. In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision. Our method, DVT, does not require re-training the existing pre-trained ViTs, and is immediately applicable to any Vision Transformer architecture. We evaluate our method on a variety of representative ViTs (DINO, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg) and demonstrate that DVT consistently improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets. We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings. Our code and checkpoints are publicly available.

PDF HTML Abstract

Introduction

In the ever-evolving landscape of artificial intelligence, Vision Transformers (ViTs) have risen as a prominent architecture for image-related tasks. Despite their state-of-the-art performance, a paper underlines a critical issue with these models: the presence of persistent noise artifacts in their outputs. These artifacts not only impair the aesthetic quality of the generated images but also affect the models' performance on downstream tasks by disturbing feature interpretability and semantic coherence.

Uncovering the Issue

A deep dive into the cause of these artifacts reveals their association with positional embeddings incorporated at the initial stages of the ViT architecture. These embeddings are meant to provide the model with spatial cues but unfortunately contribute to the artifact problem. Through an analytical approach, the researchers show how ViTs carry these artifacts in their outputs, regardless of input variations, and establish a consistent presence of these issues across numerous pre-trained models.

A Novel Denoising Approach

To address this challenge, the researchers propose an innovative two-stage solution, known as Denoising Vision Transformers (DVT). The first stage involves a universal noise model for ViT outputs, which categorizes the output into three factors: a noise-free semantic piece, and two other terms associated with position-based artifacts. This is done using neural fields to ensure cross-view consistency of features on a per-image basis. Offline applications can utilize the resulting artifact-free features created through this optimization.

Moving to the second stage, for applications that demand online functionality, a lightweight denoiser model is trained to predict these clean features from the raw ViT outputs directly. The denoiser is a single Transformer block and can be smoothly integrated into existing ViTs without the need for retraining. This stage enables real-time applications and ensures the denoiser's efficacy on new, unseen data.

Efficacy and Applications

The proposed DVT approach has undergone extensive evaluation on a variety of ViTs, highlighting its ability to notably enhance their performance across semantic and geometric tasks without re-training. The improvements in performance metrics, such as the mean Intersection over Union (mIoU), were significant. In practice, DVT's utility is evident as it can be readily applied to any existing Transformer-based architecture, extending its promise to various applications in image processing and computer vision.

Conclusion

This paper motivates a re-assessment of the design choices in ViT architectures, especially the naïve implementation of positional embeddings. It provides a robust framework for extracting artifact-free features from ViT outputs and improves the quality and reliability of features used in downstream vision tasks. The adaptive nature of DVT paves the way for immediate enhancements in pre-trained models and promises an artifact-free future for Vision Transformers.