xT: Nested Tokenization for Larger Context in Large Images (2403.01915v2)
Abstract: Modern computer vision pipelines handle large images in one of two sub-optimal ways: down-sampling or cropping. These two methods incur significant losses in the amount of information and context present in an image. There are many downstream applications in which global context matters as much as high frequency details, such as in real-world satellite imagery; in such cases researchers have to make the uncomfortable choice of which information to discard. We introduce xT, a simple framework for vision transformers which effectively aggregates global context with local details and can model large images end-to-end on contemporary GPUs. We select a set of benchmark datasets across classic vision tasks which accurately reflect a vision model's ability to understand truly large images and incorporate fine details over large scales and assess our method's improvement on them. xT is a streaming, two-stage architecture that adapts existing vision backbones and long sequence LLMs to effectively model large images without quadratic memory growth. We are able to increase accuracy by up to 8.6% on challenging classification tasks and $F_1$ score by 11.6 on context-dependent segmentation on images as large as 29,000 x 29,000 pixels.
- Neural Machine Translation by Jointly Learning to Align and Translate, 2016.
- Generating Long Sequences with Sparse Transformers, 2019.
- Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, Florence, Italy, 2019. Association for Computational Linguistics.
- Histograms of Oriented Gradients for Human Detection. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations, 2020.
- CNN cascades for segmenting sparse objects in gigapixel whole slide images. Computerized Medical Imaging and Graphics, 71:40–48, 2019.
- Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 580–587, 2014.
- Neural Turing Machines, 2014.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Efficiently modeling long sequences with structured state spaces. In The International Conference on Learning Representations (ICLR), 2022.
- Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation, 2021.
- Hyperattention: Long-context attention in near-linear time. arXiv preprint arXiv:2310.05869, 2023.
- Masked Autoencoders Are Scalable Vision Learners. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979–15988, New Orleans, LA, USA, 2022. IEEE.
- Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. 1991.
- Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 1997.
- J J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982.
- Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of Physiology, 160(1):106–154, 1962.
- The Structure of Locally Orderless Images. International Journal of Computer Vision, 31(2):159–168, 1999.
- Handwritten Digit Recognition with a Back-Propagation Network. In Advances in Neural Information Processing Systems. Morgan-Kaufmann, 1989.
- Microsoft COCO: Common Objects in Context, 2015.
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, Montreal, QC, Canada, 2021. IEEE.
- Swin transformer v2: Scaling up capacity and resolution. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Understanding the effective receptive field in deep convolutional neural networks. Advances in neural information processing systems, 29, 2016.
- S4nd: Modeling images and videos as multidimensional signals using state spaces. Advances in Neural Information Processing Systems, 35, 2022.
- xView3-SAR: Detecting Dark Fishing Activity Using Synthetic Aperture Radar Imagery. Advances in Neural Information Processing Systems, 35:37604–37616, 2022.
- Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4065–4076, 2023.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Learning Internal Representations by Error Propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations, pages 318–362. MIT Press, 1987.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
- Hiera: A hierarchical vision transformer without the bells-and-whistles. In Proceedings of the 40th International Conference on Machine Learning, pages 29441–29454, Honolulu, Hawaii, USA, 2023. JMLR.org.
- Selim Seferbekov. xView3 2nd place solution. https://github.com/DIUx-xView/xView3_second_place, 2022.
- Adaptive Attention Span in Transformers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 331–335, Florence, Italy, 2019. Association for Computational Linguistics.
- LLaMA: Open and Efficient Foundation Language Models, 2023.
- The iNaturalist Species Classification and Detection Dataset. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8769–8778, Salt Lake City, UT, 2018. IEEE.
- Attention is All you Need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
- Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions, 2021.
- Memory Networks, 2015.
- MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13587–13597, 2022.
- GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation. In The Eleventh International Conference on Learning Representations, 2022.
- Multi-Scale Context Aggregation by Dilated Convolutions, 2016.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection, 2022.
- Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding, 2021.