Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

IPT-V2: Efficient Image Processing Transformer using Hierarchical Attentions (2404.00633v1)

Published 31 Mar 2024 in cs.CV

Abstract: Recent advances have demonstrated the powerful capability of transformer architecture in image restoration. However, our analysis indicates that existing transformerbased methods can not establish both exact global and local dependencies simultaneously, which are much critical to restore the details and missing content of degraded images. To this end, we present an efficient image processing transformer architecture with hierarchical attentions, called IPTV2, adopting a focal context self-attention (FCSA) and a global grid self-attention (GGSA) to obtain adequate token interactions in local and global receptive fields. Specifically, FCSA applies the shifted window mechanism into the channel self-attention, helps capture the local context and mutual interaction across channels. And GGSA constructs long-range dependencies in the cross-window grid, aggregates global information in spatial dimension. Moreover, we introduce structural re-parameterization technique to feed-forward network to further improve the model capability. Extensive experiments demonstrate that our proposed IPT-V2 achieves state-of-the-art results on various image processing tasks, covering denoising, deblurring, deraining and obtains much better trade-off for performance and computational complexity than previous methods. Besides, we extend our method to image generation as latent diffusion backbone, and significantly outperforms DiTs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (103)
  1. A high-quality denoising dataset for smartphone cameras. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1692–1700, 2018.
  2. Defocus deblurring using dual-pixel data. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 111–126. Springer, 2020.
  3. Learning to reduce defocus blur by realistically modeling dual-pixel data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2289–2298, 2021a.
  4. Ntire 2021 challenge for defocus deblurring using dual-pixel images: Methods and results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 578–587, 2021b.
  5. Real image denoising with feature attention. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3155–3164, 2019.
  6. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22669–22679, 2023.
  7. Exploring vision transformers as diffusion learners. arXiv preprint arXiv:2212.13771, 2022.
  8. Spatial-adaptive network for single image denoising. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 171–187. Springer, 2020.
  9. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12299–12310, 2021.
  10. Simple baselines for image restoration. In European Conference on Computer Vision, pages 17–33. Springer, 2022a.
  11. Activating more pixels in image super-resolution transformer. arxiv 2022. arXiv preprint arXiv:2205.04437.
  12. Cross aggregation transformer for image restoration. Advances in Neural Information Processing Systems, 35:25478–25490, 2022b.
  13. Dual aggregation transformer for image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12312–12321, 2023.
  14. Rethinking coarse-to-fine approach in single image deblurring. In ICCV, 2021.
  15. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on image processing, 16(8):2080–2095, 2007.
  16. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13733–13742, 2021.
  17. Learning a deep convolutional network for image super-resolution. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, pages 184–199. Springer, 2014.
  18. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  19. Rich Franzen. Kodak lossless true color image suite. source: http://r0k. us/graphics/kodak, 4(2), 1999.
  20. Example-based super-resolution. IEEE Computer graphics and Applications, 22(2):56–65, 2002.
  21. Clearing the skies: A deep network architecture for single-image rain removal. TIP, 2017a.
  22. Removing rain from single images via a deep detail network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3855–3863, 2017b.
  23. Interpreting super-resolution networks with local attribution maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9199–9208, 2021.
  24. Toward convolutional blind denoising of real photographs. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1712–1722, 2019.
  25. Expandnets: Linear over-parameterization to train compact convolutional networks. Advances in Neural Information Processing Systems, 33:1298–1310, 2020.
  26. Diffit: Diffusion vision transformers for image generation. arXiv preprint arXiv:2312.02139, 2023.
  27. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  28. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  29. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5197–5206, 2015.
  30. Ntire 2019 challenge on image enhancement: Methods and results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
  31. Multi-scale progressive fusion network for single image deraining. In CVPR, 2020.
  32. Edge-based defocus blur estimation with adaptive scale selection. IEEE Transactions on Image Processing, 27(3):1126–1137, 2017.
  33. Beyond color difference: Residual interpolation for color image demosaicking. IEEE Transactions on Image Processing, 25(3):1288–1300, 2016.
  34. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1646–1654, 2016.
  35. Transfer learning from synthetic to real-noise denoising with adaptive instance normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3482–3492, 2020.
  36. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  37. DeblurGAN-v2: Deblurring (orders-of-magnitude) faster and better. In ICCV, 2019a.
  38. Deblurgan-v2: Deblurring (orders-of-magnitude) faster and better. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8878–8887, 2019b.
  39. Improved precision and recall metric for assessing generative models. Advances in neural information processing systems, 32, 2019.
  40. Deep defocus map estimation using domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12222–12230, 2019.
  41. Iterative filter adaptive network for single image defocus deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2034–2042, 2021.
  42. On efficient transformer-based image pre-training for low-level vision. arXiv e-prints, pages arXiv–2112, 2021a.
  43. Recurrent squeeze-and-excitation context aggregation net for single image deraining. In ECCV, 2018.
  44. Rain streak removal using layer priors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2736–2744, 2016.
  45. Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707, 2021b.
  46. Efficient and explicit modelling of image hierarchies for image restoration. arXiv preprint arXiv:2303.00748, 2023.
  47. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1833–1844, 2021.
  48. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144, 2017.
  49. Non-local recurrent network for image restoration. Advances in neural information processing systems, 31, 2018.
  50. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  51. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  52. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  53. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. arXiv preprint arXiv:2401.08740, 2024.
  54. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, pages 416–423. IEEE, 2001.
  55. Deep multi-scale convolutional neural network for dynamic scene deblurring. In CVPR, 2017a.
  56. Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3883–3891, 2017b.
  57. Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021.
  58. Multi-temporal recurrent neural networks for progressive non-uniform single image deblurring with incremental temporal training. In ECCV, 2020.
  59. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  60. Benchmarking denoising algorithms with real photographs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1586–1595, 2017.
  61. Spatially-adaptive image restoration using distortion-guided networks. In ICCV, 2021.
  62. Adaptive consistency prior based deep network for image denoising. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8596–8606, 2021.
  63. Progressive image deraining networks: A better and simpler baseline. In CVPR, 2019.
  64. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  65. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  66. Human-aware motion deblurring. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5572–5581, 2019.
  67. Just noticeable defocus blur detection and estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 657–665, 2015.
  68. Single image defocus deblurring using kernel-sharing parallel atrous convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2642–2650, 2021.
  69. Spatially-attentive patch-hierarchical network for adaptive motion deblurring. In CVPR, 2020.
  70. Scale-recurrent network for deep image deblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8174–8182, 2018.
  71. Stripformer: Strip transformer for fast image deblurring. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIX, pages 146–162. Springer, 2022.
  72. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  73. Eca-net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11534–11542, 2020.
  74. Repsr: Training efficient vgg-style super-resolution networks with structural re-parameterization and batch normalization. In Proceedings of the 30th ACM International Conference on Multimedia, pages 2556–2564, 2022a.
  75. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17683–17693, 2022b.
  76. Semi-supervised transfer learning for image rain removal. In CVPR, 2019.
  77. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22–31, 2021.
  78. Diffir: Efficient diffusion model for image restoration. arXiv preprint arXiv:2303.09472, 2023.
  79. Image de-raining transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022a.
  80. Stochastic window transformer for image restoration. Advances in Neural Information Processing Systems, 35:9315–9329, 2022b.
  81. Unnatural l0 sparse representation for natural image deblurring. In CVPR, 2013.
  82. Deep joint rain detection and removal from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1357–1366, 2017.
  83. Uncertainty guided multi-scale residual learning-using a cycle spinning cnn for single image de-raining. In CVPR, 2019.
  84. Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 579–588, 2021.
  85. Variational denoising network: Toward blind noise modeling and removal. Advances in neural information processing systems, 32, 2019.
  86. Dual adversarial network: Toward real-world noise removal and noise generation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 41–58. Springer, 2020.
  87. Cycleisp: Real image restoration via improved data synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2696–2705, 2020a.
  88. Learning enriched features for real image restoration and enhancement. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pages 492–511. Springer, 2020b.
  89. Multi-stage progressive image restoration. In CVPR, 2021a.
  90. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14821–14831, 2021b.
  91. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5728–5739, 2022.
  92. Density-aware single image de-raining using a multi-stream dense network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 695–704, 2018.
  93. Deep stacked hierarchical multi-patch network for image deblurring. In CVPR, 2019a.
  94. Image de-raining using a conditional generative adversarial network. IEEE transactions on circuits and systems for video technology, 30(11):3943–3956, 2019b.
  95. Accurate image restoration with attention retractable transformer. arXiv preprint arXiv:2210.01427, 2022.
  96. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE transactions on image processing, 26(7):3142–3155, 2017a.
  97. Learning deep cnn denoiser prior for image restoration. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3929–3938, 2017b.
  98. Deblurring by realistic blurring. In CVPR, 2020a.
  99. Plug-and-play image restoration with deep denoiser prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6360–6376, 2021.
  100. Color demosaicking by local directional interpolation and nonlocal adaptive thresholding. Journal of Electronic imaging, 20(2):023016–023016, 2011.
  101. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European conference on computer vision (ECCV), pages 286–301, 2018.
  102. Residual non-local attention networks for image restoration. arXiv preprint arXiv:1903.10082, 2019c.
  103. Residual dense network for image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(7):2480–2495, 2020b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhijun Tu (32 papers)
  2. Kunpeng Du (2 papers)
  3. Hanting Chen (52 papers)
  4. Hailing Wang (13 papers)
  5. Wei Li (1121 papers)
  6. Jie Hu (187 papers)
  7. Yunhe Wang (145 papers)
Citations (2)