Papers
Topics
Authors
Recent
Search
2000 character limit reached

MAXIM: Multi-Axis MLP for Image Processing

Published 9 Jan 2022 in eess.IV and cs.CV | (2201.02973v2)

Abstract: Recent progress on Transformers and multi-layer perceptron (MLP) models provide new network architectural designs for computer vision tasks. Although these models proved to be effective in many vision tasks such as image recognition, there remain challenges in adapting them for low-level vision. The inflexibility to support high-resolution images and limitations of local attention are perhaps the main bottlenecks. In this work, we present a multi-axis MLP based architecture called MAXIM, that can serve as an efficient and flexible general-purpose vision backbone for image processing tasks. MAXIM uses a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gated MLPs. Specifically, MAXIM contains two MLP-based building blocks: a multi-axis gated MLP that allows for efficient and scalable spatial mixing of local and global visual cues, and a cross-gating block, an alternative to cross-attention, which accounts for cross-feature conditioning. Both these modules are exclusively based on MLPs, but also benefit from being both global and `fully-convolutional', two properties that are desirable for image processing. Our extensive experimental results show that the proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks, including denoising, deblurring, deraining, dehazing, and enhancement while requiring fewer or comparable numbers of parameters and FLOPs than competitive models. The source code and trained models will be available at \url{https://github.com/google-research/maxim}.

Citations (403)

Summary

  • The paper demonstrates that MAXIM integrates spatially-gated MLPs within a UNet-like structure to effectively combine local and global image features.
  • It employs a multi-axis approach with dedicated modules for parallel processing, enabling efficient scaling and superior performance across denoising, deblurring, and more.
  • Experimental results show significant improvements in PSNR and SSIM on benchmarks like SIDD and GoPro, highlighting the method’s practical impact on low-level vision tasks.

MAXIM: Multi-Axis MLP for Image Processing

The paper "MAXIM: Multi-Axis MLP for Image Processing" addresses the challenges of applying transformer and MLP models to low-level vision tasks such as image processing, where high-resolution support and local attention restrictions have been significant obstacles. The researchers introduce MAXIM, a multi-axis MLP-based architecture that integrates spatially-gated MLPs within a UNet-like hierarchical structure. This design enables efficient and scalable spatial mixing of visual data on both local and global scales, while maintaining global and fully-convolutional properties essential for image processing.

Architectural Innovations

MAXIM employs a multi-axis approach featuring two primary modules:

  1. Multi-Axis Gated MLP Block (MAB): This module enables the parallel processing of local and global information by splitting the input into heads. This allows for efficient linear scaling with image size by processing one spatial axis at a time.
  2. Cross-Gating Block (CGB): Serving as an alternative to cross-attention, CGB enables cross-feature conditioning, enhancing the information flow between feature maps without incurring high computational costs.

These innovations allow MAXIM to achieve state-of-the-art results across ten benchmark datasets spanning multiple image processing tasks, including denoising, deblurring, deraining, dehazing, and enhancement. The model achieves these results while using fewer or comparable parameters and FLOPs relative to existing top-performing models.

Experimental Insights

MAXIM demonstrates superior performance, notably improving PSNR and SSIM metrics in denoising over models like VDN and MIRNet by 0.24 dB on SIDD. In deblurring tasks, MAXIM surpasses HINet, achieving a PSNR of 32.86 dB on GoPro, while showcasing its generalization capability on HIDE and RealBlur datasets. Similarly, in deraining and dehazing tasks, the architecture continues to outperform its contemporaries by significant margins.

A multi-stage, multi-scale training framework underpins these achievements. By leveraging deep supervision through multi-scale inputs and outputs, along with attention modules like supervised attention and cross-gating for feature refinement, MAXIM efficiently learns from diverse visual cues.

Broader Implications and Future Directions

The MAXIM architecture exemplifies the potential to redefine processing capabilities in low-level vision tasks by embracing MLP-based structures. This approach bridges local and global feature interactions while managing computational demands, laying a solid foundation for further exploration of MLP variants such as FFT or spatial MLP. The concept of applying MLPs to achieve fully-convolutional, scalable vision models is a promising route for enhancing dense image processing across resolutions.

Future research could extend the universality of MAXIM's multi-axis method to include other 1D operators and explore efficient solutions for ultra-high-resolution tasks. As adaptations expand, collaborative efforts that embrace interdisciplinary advancements in AI will likely fortify the paper's fundamental contributions, advancing the fields of image processing and computer vision.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.