Caterpillar: A Pure-MLP Architecture with Shifted-Pillars-Concatenation (2305.17644v3)
Abstract: Modeling in Computer Vision has evolved to MLPs. Vision MLPs naturally lack local modeling capability, to which the simplest treatment is combined with convolutional layers. Convolution, famous for its sliding window scheme, also suffers from this scheme of redundancy and lower parallel computation. In this paper, we seek to dispense with the windowing scheme and introduce a more elaborate and parallelizable method to exploit locality. To this end, we propose a new MLP module, namely Shifted-Pillars-Concatenation (SPC), that consists of two steps of processes: (1) Pillars-Shift, which generates four neighboring maps by shifting the input image along four directions, and (2) Pillars-Concatenation, which applies linear transformations and concatenation on the maps to aggregate local features. SPC module offers superior local modeling power and performance gains, making it a promising alternative to the convolutional layer. Then, we build a pure-MLP architecture called Caterpillar by replacing the convolutional layer with the SPC module in a hybrid model of sMLPNet. Extensive experiments show Caterpillar's excellent performance on both small-scale and ImageNet-1k classification benchmarks, with remarkable scalability and transfer capability possessed as well. The code is available at https://github.com/sunjin19126/Caterpillar.
- Strip-mlp: Efficient token interaction for vision mlp. In ICCV, pages 1494–1504, 2023.
- Cyclemlp: A mlp-like architecture for dense visual predictions. IEEE TPAMI, 2023.
- All you need is a few shifts: Designing efficient convolutional neural networks for image classification. In CVPR, pages 7241–7250, 2019.
- Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865–1883, 2017.
- Randaugment: Practical automated data augmentation with a reduced search space. In CVPR, pages 702–703, 2020.
- Deformable convolutional networks. In ICCV, pages 764–773, 2017.
- Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
- Kunihiko Fukushima. Cognitron: A self-organizing multilayered neural network. Biological cybernetics, 20(3-4):121–136, 1975.
- Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics, 36(4):193–202, 1980.
- Hire-mlp: Vision mlp via hierarchical rearrangement. In CVPR, pages 826–836, 2022.
- Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704, 2021.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022.
- Vision permutator: A permutable mlp-like architecture for visual recognition. IEEE TPAMI, 45(1):1328–1334, 2022.
- Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology, 160(1):106, 1962.
- Receptive fields and functional architecture in two nonstriate visual areas (18 and 19) of the cat. Journal of neurophysiology, 28(2):229–289, 1965.
- Identifying medical diagnoses and treatable diseases by image-based deep learning. cell, 172(5):1122–1131, 2018.
- Learning multiple layers of features from tiny images. Citeseer, Tech. Rep., 2009.
- Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
- Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989a.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Yann LeCun et al. Generalization and network design strategies. Connectionism in perspective, 19(143-155):18, 1989b.
- Convmlp: Hierarchical convolutional mlps for vision. In CVPR, pages 6306–6315, 2023.
- As-mlp: An axial shifted mlp architecture for vision. arXiv preprint arXiv:2107.08391, 2021.
- Tsm: Temporal shift module for efficient video understanding. In CVPR, pages 7083–7093, 2019.
- Pay attention to mlps. NeurIPS, 34:9204–9215, 2021a.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021b.
- Decoupled weight decay regularization. ICLR, 2019.
- Designing network design spaces. In CVPR, pages 10428–10436, 2020.
- Very deep convolutional networks for large-scale image recognition. ICLR, 2015.
- Sparse mlp for image recognition: Is self-attention really necessary? In AAAI, pages 2344–2351, 2022a.
- An image patch is a wave: Phase-aware vision mlp. In CVPR, pages 10935–10944, 2022b.
- Mlp-mixer: An all-mlp architecture for vision. NeurIPS, 34:24261–24272, 2021.
- Training data-efficient image transformers & distillation through attention. In ICML, pages 10347–10357, 2021.
- Resmlp: Feedforward networks for image classification with data-efficient training. IEEE TPAMI, 2022.
- Patches are all you need? arXiv preprint arXiv:2201.09792, 2022.
- Attention is all you need. NeurIPS, 30, 2017.
- Matching networks for one shot learning. NeurIPS, 29, 2016.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, pages 568–578, 2021.
- Dynamixer: a vision mlp architecture with dynamic mixing. In ICML, pages 22691–22701, 2022.
- Resnet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2110.00476, 2021.
- Cvt: Introducing convolutions to vision transformers. In ICCV, pages 22–31, 2021.
- Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
- Aggregated residual transformations for deep neural networks. In CVPR, pages 1492–1500, 2017.
- S2-mlpv2: Improved spatial-shift mlp architecture for vision. arXiv preprint arXiv:2108.01072, 2021.
- S2-mlp: Spatial-shift mlp architecture for vision. In Winter Conference on Applications of Computer Vision, pages 297–306, 2022.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, pages 6023–6032, 2019.
- mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
- Nested hierarchical transformer: Towards accurate, data-efficient and interpretable visual understanding. In AAAI, pages 3417–3425, 2022.
- Random erasing data augmentation. In AAAI, pages 13001–13008, 2020.