Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Denoising Vision Transformers (2401.02957v2)

Published 5 Jan 2024 in cs.CV

Abstract: We study a crucial yet often overlooked issue inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which hurt the performance of ViTs in downstream dense prediction tasks such as semantic segmentation, depth prediction, and object discovery. We trace this issue down to the positional embeddings at the input stage. To mitigate this, we propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT). In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean feature estimates for offline applications. In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision. Our method, DVT, does not require re-training the existing pre-trained ViTs, and is immediately applicable to any Vision Transformer architecture. We evaluate our method on a variety of representative ViTs (DINO, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg) and demonstrate that DVT consistently improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets. We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings. Our code and checkpoints are publicly available.

Introduction

In the ever-evolving landscape of artificial intelligence, Vision Transformers (ViTs) have risen as a prominent architecture for image-related tasks. Despite their state-of-the-art performance, a paper underlines a critical issue with these models: the presence of persistent noise artifacts in their outputs. These artifacts not only impair the aesthetic quality of the generated images but also affect the models' performance on downstream tasks by disturbing feature interpretability and semantic coherence.

Uncovering the Issue

A deep dive into the cause of these artifacts reveals their association with positional embeddings incorporated at the initial stages of the ViT architecture. These embeddings are meant to provide the model with spatial cues but unfortunately contribute to the artifact problem. Through an analytical approach, the researchers show how ViTs carry these artifacts in their outputs, regardless of input variations, and establish a consistent presence of these issues across numerous pre-trained models.

A Novel Denoising Approach

To address this challenge, the researchers propose an innovative two-stage solution, known as Denoising Vision Transformers (DVT). The first stage involves a universal noise model for ViT outputs, which categorizes the output into three factors: a noise-free semantic piece, and two other terms associated with position-based artifacts. This is done using neural fields to ensure cross-view consistency of features on a per-image basis. Offline applications can utilize the resulting artifact-free features created through this optimization.

Moving to the second stage, for applications that demand online functionality, a lightweight denoiser model is trained to predict these clean features from the raw ViT outputs directly. The denoiser is a single Transformer block and can be smoothly integrated into existing ViTs without the need for retraining. This stage enables real-time applications and ensures the denoiser's efficacy on new, unseen data.

Efficacy and Applications

The proposed DVT approach has undergone extensive evaluation on a variety of ViTs, highlighting its ability to notably enhance their performance across semantic and geometric tasks without re-training. The improvements in performance metrics, such as the mean Intersection over Union (mIoU), were significant. In practice, DVT's utility is evident as it can be readily applied to any existing Transformer-based architecture, extending its promise to various applications in image processing and computer vision.

Conclusion

This paper motivates a re-assessment of the design choices in ViT architectures, especially the naïve implementation of positional embeddings. It provides a robust framework for extracting artifact-free features from ViT outputs and improves the quality and reliability of features used in downstream vision tasks. The adaptive nature of DVT paves the way for immediate enhancements in pre-trained models and promises an artifact-free future for Vision Transformers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Language models are few-shot learners, 2020.
  2. End-to-End Object Detection with Transformers, page 213–229. Springer International Publishing, 2020.
  3. Emerging properties in self-supervised vision transformers. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2021a.
  4. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021b.
  5. When vision transformers outperform resnets without pre-training or strong data augmentations. arXiv preprint arXiv:2106.01548, 2021.
  6. Palm: Scaling language modeling with pathways, 2022.
  7. The concise Oxford dictionary of mathematics. Oxford University Press, USA, 2014.
  8. Vision transformers need registers, 2023.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  10. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  11. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
  12. Eva: Exploring the limits of masked visual representation learning at scale, 2022.
  13. Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023.
  14. Think before you speak: Training language models with pause tokens, 2023.
  15. Masked autoencoders are scalable vision learners, 2021.
  16. Lerf: Language embedded radiance fields, 2023.
  17. Segment anything, 2023.
  18. Decomposing nerf for editing via feature field distillation, 2022.
  19. Close to human quality tts with transformer. arXiv preprint arXiv:1809.08895, 2018.
  20. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15, 2022.
  21. Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  22. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  23. Train short, test long: Attention with linear biases enables input length extrapolation, 2022.
  24. Improving language understanding by generative pre-training. 2018.
  25. Language models are unsupervised multitask learners. 2019.
  26. Learning transferable visual models from natural language supervision, 2021.
  27. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
  28. Distilled feature fields enable few-shot language-guided manipulation. arXiv preprint arXiv:2308.07931, 2023.
  29. How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
  30. Roformer: Enhanced transformer with rotary position embedding, 2023.
  31. Fourier features let networks learn high frequency functions in low dimensional domains. NeurIPS, 2020.
  32. Deit iii: Revenge of the vit, 2022.
  33. Llama: Open and efficient foundation language models, 2023.
  34. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  35. Attention is all you need, 2023.
  36. Tacotron: Towards end-to-end speech synthesis. In Interspeech 2017. ISCA, 2017.
  37. Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  38. Efficient streaming language models with attention sinks, 2023.
  39. Emernerf: Emergent spatial-temporal scene decomposition via self-supervision, 2023.
  40. Semantic understanding of scenes through the ade20k dataset, 2018.
  41. ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Jiawei Yang (75 papers)
  2. Katie Z Luo (11 papers)
  3. Jiefeng Li (22 papers)
  4. Yonglong Tian (32 papers)
  5. Yue Wang (676 papers)
  6. Congyue Deng (23 papers)
  7. Leonidas Guibas (177 papers)
  8. Dilip Krishnan (36 papers)
  9. Kilian Q Weinberger (6 papers)
Citations (10)
Youtube Logo Streamline Icon: https://streamlinehq.com