Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 71 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 12 tok/s Pro

GPT-5 High 21 tok/s Pro

GPT-4o 81 tok/s Pro

Kimi K2 231 tok/s Pro

GPT OSS 120B 435 tok/s Pro

Claude Sonnet 4 33 tok/s Pro

2000 character limit reached

ResMLP: Feedforward networks for image classification with data-efficient training (2105.03404v2)

Published 7 May 2021 in cs.CV

Abstract: We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. When trained with a modern training strategy using heavy data-augmentation and optionally distillation, it attains surprisingly good accuracy/complexity trade-offs on ImageNet. We also train ResMLP models in a self-supervised setup, to further remove priors from employing a labelled dataset. Finally, by adapting our model to machine translation we achieve surprisingly good results. We share pre-trained models and our code based on the Timm library.

Citations (607)

View on Semantic Scholar

Summary

The paper demonstrates that eliminating self-attention and convolutions with linear cross-patch operations can achieve competitive image classification results.
It leverages a residual structure with affine transformations and extensive data augmentation, enhanced by optional teacher-guided distillation for stable training.
Experimental results on ImageNet and self-supervised tasks highlight ResMLP’s efficiency and potential for deployment in resource-constrained environments.

ResMLP: Feedforward Networks for Image Classification with Data-Efficient Training

This paper presents ResMLP, an architecture for image classification that leverages multi-layer perceptrons (MLPs) without employing convolutional layers or self-attention mechanisms. The ResMLP design revolves around simplicity and reduction of architectural priors. It incorporates a residual network structure, alternating between a linear cross-patch interaction layer and a two-layer feed-forward network for channel interaction per patch.

Architecture and Methodology

ResMLP is influenced by the Vision Transformer (ViT), yet it simplifies the architecture by removing self-attention layers and instead utilizes a linear layer for inter-patch communication. It preserves MLPs for intra-patch communication. Distinctly, the model excludes positional embeddings and replaces previous normalization practices with a simple affine transformation, effectively maintaining stability during training.

The architecture demonstrates competitive performance on ImageNet, achieving efficiency through modern training strategies that include extensive data augmentation and optional distillation. The absence of self-attention is compensated by the linear interactions between image patches, which provide adequate spatial understanding and interaction.

Experimental Results

ResMLP undergoes evaluation under multiple paradigms:

Supervised Learning: When employing ImageNet-1k data, ResMLP achieves satisfactory accuracy, comparable to some convolutional neural networks (CNNs) and transformers but with fewer structural constraints.
Self-Supervised Learning: Utilizing the DINO framework, ResMLP performs robustly without labels, demonstrating its potential as a flexible feature extractor across different contexts.
Knowledge Distillation: Distillation significantly enhances performance, showcasing its ability to leverage teacher model guidance to counteract overfitting in purely MLP-based architectures.

The linear visualization of learned layers illustrates convolution-like patterns at lower levels and evolving abstract representations at higher layers, offering insight into the emergent complexity within a seemingly simple structure.

Implications and Future Directions

The ResMLP model highlights critical implications for architectural simplicity and prior reduction in model design. The results indicate that complex mechanisms like self-attention can be replaced with linear operations without significantly compromising performance, given effective training strategies.

In practical terms, ResMLP's design opens avenues for computational efficiency and deployment ease, particularly where convolutions or self-attentions are resource-constrained. Furthermore, its adaptation to different domains such as machine translation speaks to its versatility beyond traditional visual tasks.

Future research can investigate further simplifications, alternative layer mechanisms, and the implications of large-scale training on unlabeled datasets. This exploration can help fine-tune MLP-based architectures for various applications, potentially challenging the dominance of CNNs and transformer models in certain scenarios. Integrating ResMLP-like architectures with minimal computational footprints could also be promising.

ResMLP stands as an example of the ongoing exploration into neural network architectures that balance complexity, performance, and training efficiency, contributing to the broader discourse on neural design paradigms and their effectiveness across multiple domains.