Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Recursive Generalization Transformer for Image Super-Resolution (2303.06373v4)

Published 11 Mar 2023 in cs.CV

Abstract: Transformer architectures have exhibited remarkable performance in image super-resolution (SR). Since the quadratic computational complexity of the self-attention (SA) in Transformer, existing methods tend to adopt SA in a local region to reduce overheads. However, the local design restricts the global context exploitation, which is crucial for accurate image reconstruction. In this work, we propose the Recursive Generalization Transformer (RGT) for image SR, which can capture global spatial information and is suitable for high-resolution images. Specifically, we propose the recursive-generalization self-attention (RG-SA). It recursively aggregates input features into representative feature maps, and then utilizes cross-attention to extract global information. Meanwhile, the channel dimensions of attention matrices (query, key, and value) are further scaled to mitigate the redundancy in the channel domain. Furthermore, we combine the RG-SA with local self-attention to enhance the exploitation of the global context, and propose the hybrid adaptive integration (HAI) for module integration. The HAI allows the direct and effective fusion between features at different levels (local or global). Extensive experiments demonstrate that our RGT outperforms recent state-of-the-art methods quantitatively and qualitatively. Code and pre-trained models are available at https://github.com/zhengchen1999/RGT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Xcit: Cross-covariance image transformers. In NeurIPS, 2021.
  2. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In BMVC, 2012.
  3. Regionvit: Regional-to-local attention for vision transformers. In ICLR, 2022a.
  4. Activating more pixels in image super-resolution transformer. arXiv preprint arXiv:2205.04437, 2022b.
  5. Cross aggregation transformer for image restoration. In NeurIPS, 2022c.
  6. Dual aggregation transformer for image super-resolution. In ICCV, 2023.
  7. Twins: Revisiting the design of spatial attention in vision transformers. In NeurIPS, 2021.
  8. Algorithms for learning kernels based on centered alignment. JMLR, 2012.
  9. Second-order attention network for single image super-resolution. In CVPR, 2019.
  10. Learning a deep convolutional network for image super-resolution. In ECCV, 2014.
  11. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, 2022.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  13. Deep residual learning for image recognition. In CVPR, 2016.
  14. Single image super-resolution from transformed self-exemplars. In CVPR, 2015.
  15. Accurate image super-resolution using very deep convolutional networks. In CVPR, 2016.
  16. Adam: A method for stochastic optimization. In ICLR, 2015.
  17. Similarity of neural network representations revisited. In NeurIPS, 2019.
  18. Efficient and explicit modelling of image hierarchies for image restoration. In CVPR, 2023.
  19. Feedback network for image super-resolution. In CVPR, 2019.
  20. Swinir: Image restoration using swin transformer. In ICCVW, 2021.
  21. Enhanced deep residual networks for single image super-resolution. In CVPRW, 2017.
  22. Residual feature aggregation network for image super-resolution. In CVPR, 2020.
  23. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  24. Dynamic high-pass filtering and multi-spectral attention for image super-resolution. In ICCV, 2021.
  25. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, 2001.
  26. Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications, 2017.
  27. Image super-resolution with cross-scale non-local attention and exhaustive self-exemplars mining. In CVPR, 2020.
  28. Image super-resolution with non-local sparse attention. In CVPR, 2021.
  29. Single image super-resolution via a holistic attention network. In ECCV, 2020.
  30. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
  31. Do vision transformers see like convolutional neural networks? In NeurIPS, 2021.
  32. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, 2016.
  33. Ntire 2017 challenge on single image super-resolution: Methods and results. In CVPRW, 2017.
  34. Stripformer: Strip transformer for fast image deblurring. In ECCV, 2022.
  35. Maxvit: Multi-axis vision transformer. In ECCV, 2022.
  36. Attention is all you need. In NeurIPS, 2017.
  37. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
  38. Uformer: A general u-shaped transformer for image restoration. In CVPR, 2022.
  39. Image quality assessment: from error visibility to structural similarity. TIP, 2004.
  40. Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
  41. Scalablevit: Rethinking the context-oriented generalization of vision transformer. In ECCV, 2022.
  42. Restormer: Efficient transformer for high-resolution image restoration. In CVPR, 2022.
  43. On single image scale-up using sparse-representations. In Proc. 7th Int. Conf. Curves Surf., 2010.
  44. Efficient long-range attention network for image super-resolution. In ECCV, 2022.
  45. Image super-resolution using very deep residual channel attention networks. In ECCV, 2018a.
  46. Residual dense network for image super-resolution. In CVPR, 2018b.
  47. Residual non-local attention networks for image restoration. In ICLR, 2019.
  48. Context reasoning attention network for image super-resolution. In ICCV, 2021.
  49. Cross-scale internal graph neural network for image super-resolution. In NeurIPS, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zheng Chen (221 papers)
  2. Yulun Zhang (167 papers)
  3. Jinjin Gu (56 papers)
  4. Linghe Kong (44 papers)
  5. Xiaokang Yang (207 papers)
Citations (21)

Summary

Recursive Generalization Transformer for Image Super-Resolution: An Expert Review

The paper "Recursive Generalization Transformer for Image Super-Resolution" introduces an innovative approach to the problem of image super-resolution (SR), leveraging transformer architectures to enhance performance. Traditional convolutional neural networks (CNNs), while dominant in previous SR tasks, often struggle with global context awareness due to their local processing nature. This limitation is particularly pronounced in complex high-resolution scenarios, which demand a comprehensive understanding of global spatial information to achieve accurate reconstruction. The work presented in this paper aims to address these challenges through the Recursive Generalization Transformer (RGT), a novel model that couples the strengths of transformers with specific architectural innovations to efficiently manage global spatial dependency in high-resolution images.

Key Contributions of the Paper

  1. Recursive-Generalization Self-Attention (RG-SA): Central to the RGT is the introduction of RG-SA, which aims to capture global image contexts while maintaining linear computational complexity. This is crucial given the quadratic complexity traditionally associated with vanilla self-attention mechanisms in transformers, which limits their scalability in high-resolution visual tasks. RG-SA employs a recursive aggregation module (RGM) to synthesize representative feature maps, significantly reducing spatial redundancies before applying cross-attention to these compressed representations with the original image features.
  2. Hybrid Adaptive Integration (HAI): To address the challenge of integrating global with local features, the RGT employs HAI. This mechanism facilitates the fusion of features processed by RG-SA and local self-attention blocks, which are orchestrated in an alternate configuration within the network. HAI employs learnable adaptors to manage and align the adjustment of feature maps between these blocks, ensuring that the integration preserves crucial details while encompassing a broad global context.
  3. Experimental Validation and Results: The researchers conducted extensive evaluations against existing state-of-the-art methods, demonstrating notable improvements in image reconstruction quality as measured by PSNR and SSIM metrics across multiple benchmark datasets. The results indicate superior performance in both quantitative and qualitative assessments, asserting that RGT can adeptly model global dependencies with efficiency.

Implications and Theoretical Significance

The introduction of RGT reflects a significant step forward in exploiting the transformer architecture to bridge existing gaps in image super-resolution. The recursive aggregation approach facilitates a deeper penetration into global spatial environments, which allows for a more precise and efficient manipulation of image data than is feasible with traditional CNN architectures. Furthermore, the paper’s insights suggest a broader applicability of recursive attention mechanisms to diverse computer vision tasks where high-resolution image processing and global information integration are critical.

Additionally, the recursive feature aggregation and cross-attention mechanisms proposed in RG-SA could inform future developments in transformer designs, encouraging further exploration into recursive modeling strategies and their impact on computational efficiency and model performance.

Future Prospects and Developments

The findings and methods proposed in this paper lay a promising foundation for continued development in high-resolution image processing using transformers. Future research could explore the adaptability of such recursive and hybrid strategies in other areas of computer vision and beyond, potentially expanding into video SR, medical imaging, and any domain that benefits from enhanced resolution and detailed image reconstruction. Moreover, further investigation into scalable architectures and the reduction of computational overheads might lead to even more efficient models capable of tackling large-scale datasets and applications.

In conclusion, the RGT provides a strategic advancement in the field of image super-resolution, effectively harnessing transformer capabilities to overcome existing limitations in global spatial context modeling. Its novel approach presents valuable insights and opens up new avenues for exploration in both practical applications and theoretical improvements in model architecture design.