Vision-LSTM: xLSTM as Generic Vision Backbone (2406.04303v3)

Published 6 Jun 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to a scalable and performant architecture - the xLSTM - which overcomes long-standing LSTM limitations via exponential gating and parallelizable matrix memory structure. In this report, we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process the sequence of patch tokens from top to bottom while even blocks go from bottom to top. Experiments show that ViL holds promise to be further deployed as new generic backbone for computer vision architectures.

Citations (17)

View on Semantic Scholar

Summary

The paper introduces Vision-LSTM (ViL), an architecture leveraging xLSTM with an alternating processing mechanism to act as a generic vision backbone for computer vision tasks.
Empirical studies show Vision-LSTM achieves strong performance on ImageNet-1K classification (e.g., ViL-Tiny at 78.3%), ADE20K semantic segmentation (e.g., ViL-B at 48.6% mIoU), and VTAB-1K transfer learning.
Vision-LSTM offers potential advantages for high-resolution imaging tasks due to its reduced computational complexity compared to transformer models, suggesting a promising direction for future research and hardware optimization.

Vision-LSTM: xLSTM as a Generic Vision Backbone

The paper "Vision-LSTM: xLSTM as Generic Vision Backbone" presents the Vision Long Short-Term Memory (ViL) architecture, leveraging the advances from the Extended Long Short-Term Memory (xLSTM) model initially developed for language processing. The paper explores the adaptation of xLSTM to computer vision tasks, presenting a novel approach by integrating it as a backbone for vision models, contrasting and comparing it with existing architectures such as Vision Transformer (ViT), State Space Models (SSMs), and other isotropic architectures.

Overview of Vision-LSTM Architecture

Vision-LSTM (ViL) employs a modular design utilizing xLSTM blocks that are organized in an alternating pattern to manage the non-sequential nature of visual data. This method involves dividing input images into patch tokens, similar to ViT. These tokens are processed by a stack of xLSTM blocks, with an innovative alternating mechanism that alternates the direction of token processing: odd xLSTM blocks handle tokens sequentially from the top left to the bottom right, while even blocks reverse this order. This approach capitalizes on xLSTM's ability to manage data efficiently due to its scalable, matrix memory structure with exponential gating, whilst reducing computational overhead compared to models reliant on self-attention mechanisms—as evidenced by the quadratic complexity in ViT.

Experimental Findings and Results

The empirical studies utilize datasets including ImageNet-1K, ADE20K, and VTAB-1K to assess the performance of ViL against competitive baselines. Notably, ViL exhibits notable performance gains:

ImageNet-1K Classification: ViL outperforms ViT protocols and other architectures in various scale settings. The ViL-Tiny model demonstrates superior performance with 78.3% accuracy, compared to 76.2% by the heavily optimized DeiT-III-T.
ADE20K Semantic Segmentation: ViL-B yields competitive results, achieving a mIoU of 48.6 in multi-scale evaluation, surpassing models such as DeiT-III-B despite possessing lower baseline classification accuracy.
VTAB-1K Transfer Learning: ViL underscores its generalization capabilities with superior performance, especially in structured datasets, signaling robust feature-learning potential.

However, in base scale settings of ImageNet-1K classification, ViL does not outperform all transformer models, indicating potential for further hyperparameter tuning and advancements in optimization techniques.

Implications and Future Research

ViL's introduction marks a significant development in vision architectures, displaying efficacy in areas that benefit from high-resolution imaging such as semantic segmentation and medical imaging. This efficiency is attributed to its reduced computational complexity and enhanced memory usage facilitated by chunked processing modes, presenting a more feasible approach for large-scale, high-resolution image processing compared to transformer-based models.

The authors suggest explorative directions like hierarchical architecture exploration, continued hyperparameter optimizations, and enhanced pre-training methodologies (e.g., self-supervised learning) that could bolster ViL's efficiency further. Moreover, as hardware optimization catches up, especially in dedicated parallel processing for matrix memory models, ViL's runtime could see substantial improvements, further establishing its practicality in competitive settings.

Conclusion

This work contributes meaningfully to vision model architectures by integrating advanced techniques from LLMing, pointing to a cross-disciplinary potential within AI research. As it stands, ViL becomes a promising candidate for a generalized computer vision backbone, aligning with the ongoing trend of adopting LLMing innovations in the field of visual data processing. Further explorations and methodical enhancements will consolidate its standing and possibly spur a paradigm shift towards more diversified backbone choices in vision processing tasks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/HochreiterSepp/status/1810685454829933047

https://twitter.com/benediktalkin/status/1810673066243928150

https://twitter.com/fly51fly/status/1799198928086487063

https://twitter.com/KorbiPoeppel/status/1914756757043237203

https://twitter.com/benediktalkin/status/1798991394406793518

https://twitter.com/sstoma/status/1803707536404631820

HackerNews

Vision-LSTM: xLSTM as Generic Vision Backbone (5 points, 0 comments)