Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Mixed Transformer for Single Image Super-Resolution (2305.11403v5)

Published 19 May 2023 in cs.CV

Abstract: Recently, Transformer-based methods have achieved impressive results in single image super-resolution (SISR). However, the lack of locality mechanism and high complexity limit their application in the field of super-resolution (SR). To solve these problems, we propose a new method, Efficient Mixed Transformer (EMT) in this study. Specifically, we propose the Mixed Transformer Block (MTB), consisting of multiple consecutive transformer layers, in some of which the Pixel Mixer (PM) is used to replace the Self-Attention (SA). PM can enhance the local knowledge aggregation with pixel shifting operations. At the same time, no additional complexity is introduced as PM has no parameters and floating-point operations. Moreover, we employ striped window for SA (SWSA) to gain an efficient global dependency modelling by utilizing image anisotropy. Experimental results show that EMT outperforms the existing methods on benchmark dataset and achieved state-of-the-art performance. The Code is available at https://github.com/Fried-Rice-Lab/FriedRiceLab.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 126–135, 2017.
  2. Low-complexity single image super-resolution based on nonnegative neighbor embedding. In British Machine Vision Conference, 2012.
  3. Attention in attention network for image super-resolution. arXiv preprint arXiv:2104.09497, 2021.
  4. N-gram in swin transformers for efficient lightweight image super-resolution. arXiv preprint arXiv:2211.11436, 2022.
  5. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  6. Learning a deep convolutional network for image super-resolution. In European conference on computer vision, pages 184–199. Springer, 2014.
  7. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2015.
  8. Feature distillation interaction weighting network for lightweight image super-resolution. In AAAI Conference on Artificial Intelligence (AAAI), 2022a.
  9. Lightweight bimodal network for single-image super-resolution via symmetric cnn and recursive transformer. In International Joint Conference on Artificial Intelligence (IJCAI), 2022b.
  10. A kernel statistical test of independence. In Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007, 2007.
  11. Demystifying local vision transformer: Sparse connectivity, weight sharing, and dynamic weight, 2021.
  12. Deep back-projection networks for super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1664–1673, 2018.
  13. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  14. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5197–5206, 2015.
  15. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1646–1654, 2016.
  16. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  17. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 624–632, 2017a.
  18. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 624–632, 2017b.
  19. Deep learning. nature, 521(7553):436–444, 2015.
  20. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017.
  21. On efficient transformer and image pre-training for low-level vision. arXiv e-prints, 2021a.
  22. Lapar: Linearly-assembled pixel-adaptive regression network for single image super-resolution and beyond. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  23. Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707, 2021b.
  24. Efficient and explicit modelling of image hierarchies for image restoration. arXiv preprint arXiv:2303.00748, 2023.
  25. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1833–1844, 2021.
  26. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144, 2017.
  27. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
  28. I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with restarts. 2016.
  29. Transformer for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 457–466, 2022.
  30. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 2, pages 416–423. IEEE, 2001.
  31. Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications, 76:21811–21838, 2017.
  32. Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. arXiv preprint arXiv:2010.15327, 2020.
  33. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  34. Enhancenet: Single image super-resolution through automated texture synthesis. In Proceedings of the IEEE international conference on computer vision, pages 4491–4500, 2017a.
  35. Enhancenet: Single image super-resolution through automated texture synthesis. In Proceedings of the IEEE international conference on computer vision, pages 4491–4500, 2017b.
  36. Image super-resolution using efficient striped window transformer. arXiv preprint arXiv:2301.09869, 2023.
  37. Memnet: A persistent memory network for image restoration. In Proceedings of the IEEE international conference on computer vision, pages 4539–4547, 2017.
  38. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  39. Deep networks for image super-resolution with sparse prior. In Proceedings of the IEEE international conference on computer vision, pages 370–378, 2015.
  40. Shift: A zero flop, zero parameter alternative to spatial convolutions. 2017.
  41. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22–31, 2021.
  42. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10819–10829, 2022.
  43. On single image scale-up using sparse-representations. In Curves and Surfaces: 7th International Conference, Avignon, France, June 24-30, 2010, Revised Selected Papers 7, pages 711–730. Springer, 2012.
  44. Learning a single convolutional super-resolution network for multiple degradations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3262–3271, 2018a.
  45. Efficient long-range attention network for image super-resolution. arXiv preprint arXiv:2203.06697, 2022.
  46. Residual dense network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2472–2481, 2018b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ling Zheng (8 papers)
  2. Jinchen Zhu (5 papers)
  3. Jinpeng Shi (2 papers)
  4. Shizhuang Weng (4 papers)
Citations (15)

Summary

Efficient Mixed Transformer for Single Image Super-Resolution: An Expert Analysis

The paper "Efficient Mixed Transformer for Single Image Super-Resolution" introduces a novel approach to address the challenges in Single Image Super-Resolution (SISR) using Transformer models. The increasing popularity of Transformers in Computer Vision tasks, including SISR, is primarily due to their excellent capability to model global dependencies through Self-Attention (SA). However, these models often grapple with the inefficiencies arising from the lack of a locality mechanism and high computational complexity, which limit their deployment on resource-constrained devices. The researchers present an Efficient Mixed Transformer (EMT) that leverages a Mixed Transformer Block (MTB) to mitigate these issues and integrate novel components like the Pixel Mixer (PM) and Striped Window Self-Attention (SWSA).

Methodological Contributions

The EMT architecture is systematically divided into three components: the Shallow Feature Extraction Unit (SFEU), the Deep Feature Extraction Unit (DFEU), and the Reconstruction Unit (RECU).

  1. Mixed Transformer Block (MTB): This is the core innovation, which alternates between Global Transformer Layers (GTLs) and Local Transformer Layers (LTLs). GTLs retain the self-attention mechanism for modeling long-range dependencies, while LTLs use local perceptrons to foster locality. The Pixel Mixer introduced in LTLs further enhances local knowledge aggregation by employing pixel shifting for feature mixing across channels, operating without parameter overhead or additional FLOPs.
  2. Pixel Mixer (PM): PM addresses the Transformer's deficiency in encoding spatial locality. By segmenting channels and applying a sequence of systematic pixel shifts, PM extends the receptive field and effectively captures localized spatial interactions within features. This innovation is leveraged without adding to the computational complexity, making it suitable for constrained environments.
  3. Striped Window Self-Attention (SWSA): To increase computational efficiency, SWSA utilizes anisotropically striped windows in the self-attention mechanism, optimally aligning with the repetitive patterns in image data. This adaptation helps in efficiently modeling global dependencies, leveraging image anisotropy for better feature capture.

Experimental Results

The paper's claims are substantiated through rigorous experiments, showing that EMT exhibits superior performance across standard benchmark datasets like Set5, Set14, BSD100, Urban100, and Manga109. EMT not only achieved state-of-the-art results in terms of PSNR and SSIM metrics but did so with relatively fewer network parameters compared to existing methods. Another noteworthy aspect is the ablation studies on the number and type of transformer layers, confirming that a mixed configuration enhances performance while maintaining computational efficiency.

Implications and Future Prospects

The proposed EMT architecture represents a significant stride in adapting Transformer models for SISR tasks with limited computational resources. The effective integration of PM to enhance locality without added complexity and the novel use of SWSA indicates a focused approach to overcoming the limitations of existing transformer-based models in real-world applications. The findings hold promising implications for enabling lightweight SISR solutions on mobile and embedded platforms, a crucial requirement for edge computing in scenarios like real-time video processing.

Looking forward, the conceptual framework and methodologies outlined in EMT could be extended to other low-level vision tasks that require a balance between local feature representation and global context modeling. Further optimizations in SA through more sophisticated windowing strategies or hybrid models incorporating CNN characteristics could pave the way for Transformers' broader adoption beyond high-resource settings.

In summary, the research advances the field of SISR by proposing pragmatic solutions to well-known transformer deficiencies, potentially catalyzing subsequent innovations in both methodological refinements and practical deployments.