Training Transformer Models by Wavelet Losses Improves Quantitative and Visual Performance in Single Image Super-Resolution (2404.11273v1)
Abstract: Transformer-based models have achieved remarkable results in low-level vision tasks including image super-resolution (SR). However, early Transformer-based approaches that rely on self-attention within non-overlapping windows encounter challenges in acquiring global information. To activate more input pixels globally, hybrid attention models have been proposed. Moreover, training by solely minimizing pixel-wise RGB losses, such as L1, have been found inadequate for capturing essential high-frequency details. This paper presents two contributions: i) We introduce convolutional non-local sparse attention (NLSA) blocks to extend the hybrid transformer architecture in order to further enhance its receptive field. ii) We employ wavelet losses to train Transformer models to improve quantitative and subjective performance. While wavelet losses have been explored previously, showing their power in training Transformer-based SR models is novel. Our experimental results demonstrate that the proposed model provides state-of-the-art PSNR results as well as superior visual performance across various benchmark datasets.
- NTIRE 2017 Challenge on single image super-resolution: Dataset and study. In IEEE/CVF Conf. on Comp. Vision and Patt. Recog. (CVPR) Workshops, 2017.
- Single image super-resolution using frequency-dependent convolutional neural networks. In IEEE Int. Conf. on Industrial Technology (ICIT), pages 692–695. IEEE, 2020.
- Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In British Machine Vision Conference, pages 135.1–135.10, 2012.
- Video super-resolution transformer. arXiv preprint arXiv:2106.06847, 2021.
- Pre-trained image processing transformer. In IEEE/CVF Conf. on Comp. Vision and Patt. Recog. (CVPR), pages 12299–12310, 2021.
- Learning a sparse transformer network for effective image deraining. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pages 5896–5905, 2023a.
- Activating more pixels in image super-resolution transformer. In IEEE/CVF Conf. on Comp. Vision and Patt. Recog. (CVPR), pages 22367–22377, 2023b.
- Cross aggregation transformer for image restoration. In NeurIPS, 2022.
- Dual aggregation transformer for image super-resolution. In IEEE/CVF Int. Conf. on Computer Vision (ICCV), pages 12312–12321, 2023c.
- Ntire 2024 challenge on image super-resolution (x4): Methods and results. In Computer Vision and Pattern Recognition Workshops, 2024.
- Twins: Revisiting the design of spatial attention in vision transformers. In NeurIPS 2021, 2021.
- Conditional positional encodings for vision transformers. In ICLR 2023, 2023.
- Second-order attention network for single image super-resolution. In IEEE/CVF Conf. on Comp. Vision and Patt. Recog. (CVPR), pages 11057–11066, 2019.
- Wavelet domain style transfer for an effective perception-distortion tradeoff in single image super-resolution. IEEE/CVF Int. Conf. on Computer Vision (ICCV), pages 3076–3085, 2019.
- Learning a deep convolutional network for image super-resolution. In European Conf. Comp. Vision (ECCV), pages 184–199, 2014.
- Cswin transformer: A general vision transformer backbone with cross-shaped windows. In IEEE/CVF Conf. on Comp. Vis. and Patt. Recog. (CVPR), pages 12124–12134, 2022.
- Dw-gan: A discrete wavelet transform gan for nonhomogeneous dehazing. In IEEE/CVF Conf. on Comp. Vision and Patt. Recog., pages 203–212, 2021.
- Fourier space losses for efficient perceptual image super-resolution. IEEE/CVF Int. Conf. on Computer Vision (ICCV), pages 2340–2349, 2021.
- Deep wavelet prediction for image super-resolution. In IEEE/CVF Conf. Comp. Vis. and Patt. Recog. (CVPRW), pages 104–113, 2017.
- Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 1698–1706, 2017a.
- Wavelet-SRnet: A wavelet-based CNN for multi-scale face super resolution. In IEEE Int. Conf. Comp. Vis. (ICCV), pages 1689–1697, 2017b.
- Single image super-resolution from transformed self-exemplars. In IEEE Conf. on Comp. Vision and Patt. Recog. (CVPR), pages 5197–5206, 2015.
- An overview of wavelet based multiresolution analyses. SIAM Review, 36(3):377–412, 1994.
- Focal frequency loss for image reconstruction and synthesis. In Int. Conf. Comp. Vision (ICCV), 2021.
- Accurate image super-resolution using very deep convolutional networks. IEEE/CVF Conf. on Comp. Vision and Patt. Recog. (CVPR), pages 1646–1654, 2016.
- Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- Training generative image super-resolution models by wavelet-domain losses enables better control of artifacts. arXiv preprint arXiv:2402.19215, 2024.
- Feature modulation transformer: Cross-refinement of global representation via high-frequency prior for image super-resolution. In IEEE/CVF Int. Conf. on Comp. Vision, pages 12514–12524, 2023a.
- Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707, 2021.
- Lsdir: A large scale dataset for image restoration. In IEEE/CVF Conf. on Comp. Vision and Patt. Recog (CVPR.) Workshops, pages 1775–1787, 2023b.
- Swinir: Image restoration using swin transformer. IEEE/CVF Int. Conf. on Comp. Vision (ICCV) Workshops, pages 1833–1844, 2021.
- Enhanced deep residual networks for single image super-resolution. In IEEE/CVF CVPR Workshops, 2017.
- Pasta: Towards flexible and efficient hdr imaging via progressively aggregated spatio-temporal aligment. arXiv preprint arXiv:2403.10376, 2024.
- Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF Int. Conf. on Computer Vision (ICCV), pages 9992–10002, 2021.
- On the effectiveness of spectral discriminators for perceptual quality improvement. In IEEE/CVF Int. Conf. on Comp. Vision (ICCV), 2023.
- S. Mallat. Wavelets for a vision. Proceedings of the IEEE, 84(4):604–614, 1996.
- A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In IEEE Int. Conf. on Computer Vision. (ICCV), pages 416–423 vol.2, 2001.
- Image super-resolution with non-local sparse attention. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 3516–3525, 2021.
- Single image super-resolution via a holistic attention network. In European Conf. Comp Vis.ion (ECCV), page 191–207, 2020.
- Fan: Frequency aggregation network for real image super-resolution. In Euro. Conf. Comp (ECCV) Workshops: Glasgow, UK, pages 468–483, 2020.
- Wavelet-integrated deep networks for single image super-resolution. Electronics, 8:553, 2019.
- Frequency-domain deep guided image denoising. IEEE Trans. on Multimedia, 2022.
- A cross transformer for image denoising. Information Fusion, 102:102043, 2024.
- Image super-resolution using dense skip connections. In IEEE Int. Conf. on Comp. Vision (ICCV), pages 4809–4817, 2017.
- Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In IEEE/CVF Int. Conf. on Computer Vision (ICCV), pages 548–558, 2021.
- ESRGAN: enhanced super-resolution generative adversarial networks. In European Conf. on Comp. Vision (ECCV) Workshops, 2018.
- Uformer: A general u-shaped transformer for image restoration. In IEEE/CVF Conf. on Comp. Vision and Patt. Recog. (CVPR), pages 17683–17693, 2022.
- CVT: Introducing convolutions to vision transformers. In IEEE/CVF Int. Conf. on Computer Vision (ICCV), pages 22–31, 2021.
- Pale transformer: A general vision transformer backbone with pale-shaped attention. In AAAI Conf. on Artificial Intelligence, pages 2731–2739, 2022.
- Early convolutions help transformers see better. Advances in neural information processing systems, 34:30392–30400, 2021.
- Wavelet-based dual recursive network for image super-resolution. IEEE Trans. on Neural Networks and Learning Systems, 33(2):707–720, 2020.
- Wavelet-based residual attention network for image super-resolution. Neurocomputing, 382:116–126, 2020.
- Incorporating convolution designs into visual transformers. In IEEE/CVF Int. Conf. on Computer Vision (ICCV), pages 579–588, 2021.
- On single image scale-up using sparse-representations. In Curves and Surfaces, pages 711–730, Berlin, Heidelberg, 2012. Springer.
- Swinfir: Revisiting the swinir with fast fourier convolution and improved training for image super-resolution. arXiv preprint arXiv:2208.11247, 2022a.
- A two-stage attentive network for single image super-resolution. IEEE Trans. on Circuits and Systems for Video Tech., 2021.
- Accurate image restoration with attention retractable transformer. In Int. Conf. Learning Representations (ICLR), 2023.
- Efficient long-range attention network for image super-resolution. In Euro. Conf. Comp. Vision (ECCV), 2022b.
- Image super-resolution using very deep residual channel attention networks. In European Conf. Comp. Vision (ECCV), 2018a.
- Residual dense network for image super-resolution. In IEEE/CVF conf. on Comp. Vision and Patt. Recog. (CVPR), pages 2472–2481, 2018b.
- Perception-distortion balanced admm optimization for single-image super-resolution. In European Conf. on Comp. Vision (ECCV), 2022c.
- Srformer: Permuted self-attention for single image super-resolution. In IEEE/CVF Int. Conf. on Computer Vision (ICCV), pages 12780–12791, 2023.
- Wavelet loss function for auto-encoder. IEEE Access, 9:27101–27108, 2021.
- Attention retractable frequency fusion transformer for image super resolution. In IEEE/CVF Conf. on Comp. Vision and Patt. Recog. (CVPR), pages 1756–1763, 2023.
- Cansu Korkmaz (10 papers)
- A. Murat Tekalp (31 papers)