Denoising Vision Transformers (2401.02957v2)
Abstract: We study a crucial yet often overlooked issue inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which hurt the performance of ViTs in downstream dense prediction tasks such as semantic segmentation, depth prediction, and object discovery. We trace this issue down to the positional embeddings at the input stage. To mitigate this, we propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT). In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean feature estimates for offline applications. In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision. Our method, DVT, does not require re-training the existing pre-trained ViTs, and is immediately applicable to any Vision Transformer architecture. We evaluate our method on a variety of representative ViTs (DINO, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg) and demonstrate that DVT consistently improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets. We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings. Our code and checkpoints are publicly available.
- Language models are few-shot learners, 2020.
- End-to-End Object Detection with Transformers, page 213–229. Springer International Publishing, 2020.
- Emerging properties in self-supervised vision transformers. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2021a.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021b.
- When vision transformers outperform resnets without pre-training or strong data augmentations. arXiv preprint arXiv:2106.01548, 2021.
- Palm: Scaling language modeling with pathways, 2022.
- The concise Oxford dictionary of mathematics. Oxford University Press, USA, 2014.
- Vision transformers need registers, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
- The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
- Eva: Exploring the limits of masked visual representation learning at scale, 2022.
- Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023.
- Think before you speak: Training language models with pause tokens, 2023.
- Masked autoencoders are scalable vision learners, 2021.
- Lerf: Language embedded radiance fields, 2023.
- Segment anything, 2023.
- Decomposing nerf for editing via feature field distillation, 2022.
- Close to human quality tts with transformer. arXiv preprint arXiv:1809.08895, 2018.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15, 2022.
- Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Train short, test long: Attention with linear biases enables input length extrapolation, 2022.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. 2019.
- Learning transferable visual models from natural language supervision, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
- Distilled feature fields enable few-shot language-guided manipulation. arXiv preprint arXiv:2308.07931, 2023.
- How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
- Roformer: Enhanced transformer with rotary position embedding, 2023.
- Fourier features let networks learn high frequency functions in low dimensional domains. NeurIPS, 2020.
- Deit iii: Revenge of the vit, 2022.
- Llama: Open and efficient foundation language models, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Attention is all you need, 2023.
- Tacotron: Towards end-to-end speech synthesis. In Interspeech 2017. ISCA, 2017.
- Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
- Efficient streaming language models with attention sinks, 2023.
- Emernerf: Emergent spatial-temporal scene decomposition via self-supervision, 2023.
- Semantic understanding of scenes through the ade20k dataset, 2018.
- ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.