MAXIM: Multi-Axis MLP for Image Processing (2201.02973v2)

Published 9 Jan 2022 in eess.IV and cs.CV

Abstract: Recent progress on Transformers and multi-layer perceptron (MLP) models provide new network architectural designs for computer vision tasks. Although these models proved to be effective in many vision tasks such as image recognition, there remain challenges in adapting them for low-level vision. The inflexibility to support high-resolution images and limitations of local attention are perhaps the main bottlenecks. In this work, we present a multi-axis MLP based architecture called MAXIM, that can serve as an efficient and flexible general-purpose vision backbone for image processing tasks. MAXIM uses a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gated MLPs. Specifically, MAXIM contains two MLP-based building blocks: a multi-axis gated MLP that allows for efficient and scalable spatial mixing of local and global visual cues, and a cross-gating block, an alternative to cross-attention, which accounts for cross-feature conditioning. Both these modules are exclusively based on MLPs, but also benefit from being both global and `fully-convolutional', two properties that are desirable for image processing. Our extensive experimental results show that the proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks, including denoising, deblurring, deraining, dehazing, and enhancement while requiring fewer or comparable numbers of parameters and FLOPs than competitive models. The source code and trained models will be available at \url{https://github.com/google-research/maxim}.

PDF Abstract

MAXIM: Multi-Axis MLP for Image Processing

The paper "MAXIM: Multi-Axis MLP for Image Processing" addresses the challenges of applying transformer and MLP models to low-level vision tasks such as image processing, where high-resolution support and local attention restrictions have been significant obstacles. The researchers introduce MAXIM, a multi-axis MLP-based architecture that integrates spatially-gated MLPs within a UNet-like hierarchical structure. This design enables efficient and scalable spatial mixing of visual data on both local and global scales, while maintaining global and fully-convolutional properties essential for image processing.

Architectural Innovations

MAXIM employs a multi-axis approach featuring two primary modules:

Multi-Axis Gated MLP Block (MAB): This module enables the parallel processing of local and global information by splitting the input into heads. This allows for efficient linear scaling with image size by processing one spatial axis at a time.
Cross-Gating Block (CGB): Serving as an alternative to cross-attention, CGB enables cross-feature conditioning, enhancing the information flow between feature maps without incurring high computational costs.

These innovations allow MAXIM to achieve state-of-the-art results across ten benchmark datasets spanning multiple image processing tasks, including denoising, deblurring, deraining, dehazing, and enhancement. The model achieves these results while using fewer or comparable parameters and FLOPs relative to existing top-performing models.

Experimental Insights

MAXIM demonstrates superior performance, notably improving PSNR and SSIM metrics in denoising over models like VDN and MIRNet by 0.24 dB on SIDD. In deblurring tasks, MAXIM surpasses HINet, achieving a PSNR of 32.86 dB on GoPro, while showcasing its generalization capability on HIDE and RealBlur datasets. Similarly, in deraining and dehazing tasks, the architecture continues to outperform its contemporaries by significant margins.

A multi-stage, multi-scale training framework underpins these achievements. By leveraging deep supervision through multi-scale inputs and outputs, along with attention modules like supervised attention and cross-gating for feature refinement, MAXIM efficiently learns from diverse visual cues.

Broader Implications and Future Directions

The MAXIM architecture exemplifies the potential to redefine processing capabilities in low-level vision tasks by embracing MLP-based structures. This approach bridges local and global feature interactions while managing computational demands, laying a solid foundation for further exploration of MLP variants such as FFT or spatial MLP. The concept of applying MLPs to achieve fully-convolutional, scalable vision models is a promising route for enhancing dense image processing across resolutions.

Future research could extend the universality of MAXIM's multi-axis method to include other 1D operators and explore efficient solutions for ultra-high-resolution tasks. As adaptations expand, collaborative efforts that embrace interdisciplinary advancements in AI will likely fortify the paper's fundamental contributions, advancing the fields of image processing and computer vision.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Zhengzhong Tu (71 papers)
Hossein Talebi (24 papers)
Han Zhang (338 papers)
Feng Yang (147 papers)
Peyman Milanfar (64 papers)
Alan Bovik (10 papers)
Yinxiao Li (20 papers)

Citations (403)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - google-research/maxim: [CVPR 2022 Oral] Official repository for "MAXIM: Multi-Axis MLP for Image Processing". SOTA for denoising, deblurring, deraining, dehazing, and enhancement. (962 stars)