MAXIM: Multi-Axis MLP for Image Processing
The paper "MAXIM: Multi-Axis MLP for Image Processing" addresses the challenges of applying transformer and MLP models to low-level vision tasks such as image processing, where high-resolution support and local attention restrictions have been significant obstacles. The researchers introduce MAXIM, a multi-axis MLP-based architecture that integrates spatially-gated MLPs within a UNet-like hierarchical structure. This design enables efficient and scalable spatial mixing of visual data on both local and global scales, while maintaining global and fully-convolutional properties essential for image processing.
Architectural Innovations
MAXIM employs a multi-axis approach featuring two primary modules:
- Multi-Axis Gated MLP Block (MAB): This module enables the parallel processing of local and global information by splitting the input into heads. This allows for efficient linear scaling with image size by processing one spatial axis at a time.
- Cross-Gating Block (CGB): Serving as an alternative to cross-attention, CGB enables cross-feature conditioning, enhancing the information flow between feature maps without incurring high computational costs.
These innovations allow MAXIM to achieve state-of-the-art results across ten benchmark datasets spanning multiple image processing tasks, including denoising, deblurring, deraining, dehazing, and enhancement. The model achieves these results while using fewer or comparable parameters and FLOPs relative to existing top-performing models.
Experimental Insights
MAXIM demonstrates superior performance, notably improving PSNR and SSIM metrics in denoising over models like VDN and MIRNet by 0.24 dB on SIDD. In deblurring tasks, MAXIM surpasses HINet, achieving a PSNR of 32.86 dB on GoPro, while showcasing its generalization capability on HIDE and RealBlur datasets. Similarly, in deraining and dehazing tasks, the architecture continues to outperform its contemporaries by significant margins.
A multi-stage, multi-scale training framework underpins these achievements. By leveraging deep supervision through multi-scale inputs and outputs, along with attention modules like supervised attention and cross-gating for feature refinement, MAXIM efficiently learns from diverse visual cues.
Broader Implications and Future Directions
The MAXIM architecture exemplifies the potential to redefine processing capabilities in low-level vision tasks by embracing MLP-based structures. This approach bridges local and global feature interactions while managing computational demands, laying a solid foundation for further exploration of MLP variants such as FFT or spatial MLP. The concept of applying MLPs to achieve fully-convolutional, scalable vision models is a promising route for enhancing dense image processing across resolutions.
Future research could extend the universality of MAXIM's multi-axis method to include other 1D operators and explore efficient solutions for ultra-high-resolution tasks. As adaptations expand, collaborative efforts that embrace interdisciplinary advancements in AI will likely fortify the paper's fundamental contributions, advancing the fields of image processing and computer vision.