Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words (2309.16108v4)

Published 28 Sep 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Vision Transformer (ViT) has emerged as a powerful architecture in the realm of modern computer vision. However, its application in certain imaging fields, such as microscopy and satellite imaging, presents unique challenges. In these domains, images often contain multiple channels, each carrying semantically distinct and independent information. Furthermore, the model must demonstrate robustness to sparsity in input channels, as they may not be densely available during training or testing. In this paper, we propose a modification to the ViT architecture that enhances reasoning across the input channels and introduce Hierarchical Channel Sampling (HCS) as an additional regularization technique to ensure robustness when only partial channels are presented during test time. Our proposed model, ChannelViT, constructs patch tokens independently from each input channel and utilizes a learnable channel embedding that is added to the patch tokens, similar to positional embeddings. We evaluate the performance of ChannelViT on ImageNet, JUMP-CP (microscopy cell imaging), and So2Sat (satellite imaging). Our results show that ChannelViT outperforms ViT on classification tasks and generalizes well, even when a subset of input channels is used during testing. Across our experiments, HCS proves to be a powerful regularizer, independent of the architecture employed, suggesting itself as a straightforward technique for robust ViT training. Lastly, we find that ChannelViT generalizes effectively even when there is limited access to all channels during training, highlighting its potential for multi-channel imaging under real-world conditions with sparse sensors. Our code is available at https://github.com/insitro/ChannelViT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. Contextual vision transformers for robust representation learning. arXiv preprint arXiv:2305.19402, 2023.
  3. End-to-end object detection with transformers. In European conference on computer vision, pp.  213–229. Springer, 2020.
  4. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  9650–9660, 2021.
  5. Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations. bioRxiv, pp.  2022–01, 2022.
  6. Transformer interpretability beyond attention visualization, 2021.
  7. Optimizing relevance maps of vision transformers improves robustness. Advances in Neural Information Processing Systems, 35:33618–33632, 2022.
  8. Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems, 35:197–211, 2022.
  9. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  10. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
  12. Out-of-distribution robustness via targeted augmentations. In NeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications, 2022.
  13. What do vision transformers learn? a visual exploration. arXiv preprint arXiv:2212.06727, 2022.
  14. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  15. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp.  574–584, 2022a.
  16. Unetformer: A unified vision transformer model and pre-training framework for 3d medical image segmentation. arXiv preprint arXiv:2204.00631, 2022b.
  17. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  18. Weighted channel dropout for regularization of deep convolutional neural network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  8425–8432, 2019.
  19. Multi-channel vision transformer for epileptic seizure prediction. Biomedicines, 10(7):1551, 2022.
  20. A vision transformer model for convolution-free multilabel classification of satellite imagery in deforestation monitoring. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  21. Just train twice: Improving group robustness without training group information. In International Conference on Machine Learning, pp. 6781–6792. PMLR, 2021.
  22. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  23. On the robustness of vision transformers to adversarial examples. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  7838–7847, 2021.
  24. Towards robust vision transformer. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp.  12042–12051, 2022.
  25. Climax: A foundation model for weather and climate, 2023.
  26. Distributionally robust neural networks. In International Conference on Learning Representations, 2019.
  27. Self-supervised vision transformers for land-cover segmentation and classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1422–1431, 2022.
  28. A pooled cell painting crispr screening platform enables de novo inference of gene function by self-supervised deep learning. bioRxiv, pp.  2023–08, 2023.
  29. Fully attentional network for semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  2280–2288, 2022.
  30. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  31. Vits for sits: Vision transformers for satellite image time series. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10418–10428, 2023.
  32. Efficient object localization using convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  648–656, 2015.
  33. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pp. 10347–10357. PMLR, 2021.
  34. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  35. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  36. Imagenet training in minutes. In Proceedings of the 47th International Conference on Parallel Processing, pp.  1–10, 2018.
  37. The channel-spatial attention-based vision transformer network for automated, accurate prediction of crop nitrogen status from uav imagery. arXiv e-prints, pp.  arXiv–2111, 2021.
  38. Understanding the robustness in vision transformers. In International Conference on Machine Learning, pp. 27378–27394. PMLR, 2022.
  39. So2sat lcz42: A benchmark data set for the classification of global local climate zones [software and data sets]. IEEE Geoscience and Remote Sensing Magazine, 8(3):76–89, 2020a. doi: 10.1109/MGRS.2020.2964708.
  40. Xiaoxiang Zhu. So2sat lcz42 3 splits, 2021.
  41. So2sat lcz42, 2019.
  42. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yujia Bao (20 papers)
  2. Srinivasan Sivanandan (3 papers)
  3. Theofanis Karaletsos (28 papers)
Citations (9)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com