Introduction
In the ever-evolving landscape of artificial intelligence, Vision Transformers (ViTs) have risen as a prominent architecture for image-related tasks. Despite their state-of-the-art performance, a paper underlines a critical issue with these models: the presence of persistent noise artifacts in their outputs. These artifacts not only impair the aesthetic quality of the generated images but also affect the models' performance on downstream tasks by disturbing feature interpretability and semantic coherence.
Uncovering the Issue
A deep dive into the cause of these artifacts reveals their association with positional embeddings incorporated at the initial stages of the ViT architecture. These embeddings are meant to provide the model with spatial cues but unfortunately contribute to the artifact problem. Through an analytical approach, the researchers show how ViTs carry these artifacts in their outputs, regardless of input variations, and establish a consistent presence of these issues across numerous pre-trained models.
A Novel Denoising Approach
To address this challenge, the researchers propose an innovative two-stage solution, known as Denoising Vision Transformers (DVT). The first stage involves a universal noise model for ViT outputs, which categorizes the output into three factors: a noise-free semantic piece, and two other terms associated with position-based artifacts. This is done using neural fields to ensure cross-view consistency of features on a per-image basis. Offline applications can utilize the resulting artifact-free features created through this optimization.
Moving to the second stage, for applications that demand online functionality, a lightweight denoiser model is trained to predict these clean features from the raw ViT outputs directly. The denoiser is a single Transformer block and can be smoothly integrated into existing ViTs without the need for retraining. This stage enables real-time applications and ensures the denoiser's efficacy on new, unseen data.
Efficacy and Applications
The proposed DVT approach has undergone extensive evaluation on a variety of ViTs, highlighting its ability to notably enhance their performance across semantic and geometric tasks without re-training. The improvements in performance metrics, such as the mean Intersection over Union (mIoU), were significant. In practice, DVT's utility is evident as it can be readily applied to any existing Transformer-based architecture, extending its promise to various applications in image processing and computer vision.
Conclusion
This paper motivates a re-assessment of the design choices in ViT architectures, especially the naïve implementation of positional embeddings. It provides a robust framework for extracting artifact-free features from ViT outputs and improves the quality and reliability of features used in downstream vision tasks. The adaptive nature of DVT paves the way for immediate enhancements in pre-trained models and promises an artifact-free future for Vision Transformers.