Light Field Image Super-Resolution with Transformers

Published 17 Aug 2021 in cs.CV | (2108.07597v2)

Abstract: Light field (LF) image super-resolution (SR) aims at reconstructing high-resolution LF images from their low-resolution counterparts. Although CNN-based methods have achieved remarkable performance in LF image SR, these methods cannot fully model the non-local properties of the 4D LF data. In this paper, we propose a simple but effective Transformer-based method for LF image SR. In our method, an angular Transformer is designed to incorporate complementary information among different views, and a spatial Transformer is developed to capture both local and long-range dependencies within each sub-aperture image. With the proposed angular and spatial Transformers, the beneficial information in an LF can be fully exploited and the SR performance is boosted. We validate the effectiveness of our angular and spatial Transformers through extensive ablation studies, and compare our method to recent state-of-the-art methods on five public LF datasets. Our method achieves superior SR performance with a small model size and low computational cost. Code is available at https://github.com/ZhengyuLiang24/LFT.

Abstract PDF Upgrade to Chat

Citations (97)

View on Semantic Scholar

Summary

The paper introduces LFT, a novel Transformer-based model for light field image super-resolution that effectively integrates angular and spatial information.
LFT employs specialized Angular and Spatial Transformers to leverage the 4D structure of light field data, capturing both multi-view relationships and long-range spatial dependencies.
Experimental results show LFT achieves superior super-resolution performance compared to state-of-the-art CNN methods on challenging datasets, demonstrating the potential of Transformers for LF image tasks.

Assessment of "Light Field Image Super-Resolution with Transformers"

The paper "Light Field Image Super-Resolution with Transformers" addresses the challenge of enhancing the resolution of Light Field (LF) images using a Transformer-based approach, marking a significant departure from the predominantly convolutional neural network (CNN)-based methodologies prevalent in this domain. The proposed model, named LFT, utilizes the data-rich nature of LF image structures by employing both angular and spatial Transformers to effectively integrate complementary view information and long-range dependencies within sub-aperture images.

Methodological Framework

The authors introduce a pioneering approach by designing two specialized Transformers integrated within the network:

Angular Transformer: This component is crafted to explore and consolidate information across different views, leveraging the angular diversity intrinsic to LF data. Through self-attention mechanisms, this Transformer models specific relationships among views which subsequently enhances the super-resolution task.
Spatial Transformer: This module focuses on capturing both local features and long-range contextual information within individual sub-aperture images, overcoming limitations inherent in convolutional operations that are typically restricted by local receptive fields.

Together, these Transformers enable a comprehensive utilization of the LF's 4D data structure, leading to advancements over traditional CNN-based models that often neglect either the spatial or angular dimension, hindering optimal performance.

Experimental Evaluation

The empirical studies conducted validate LFT's superior performance over ten state-of-the-art SR methodologies, ranging from conventional single image SR models to advanced LF image SR frameworks. Notably, the LFT achieves remarkable performance on challenging datasets, like STFgantry, which poses complex scenes with varying disparities and occlusions. The reported results underscore the model's ability to maintain high-resolution quality with a reduced computational burden, as evidenced by its lower parameter count and FLOPs requirement.

A numerical performance comparison revealed substantial improvements in PSNR and SSIM metrics across datasets, illustrating the efficacious integration of angular and spatial dynamics via the proposed Transformer architecture. These advancements point towards a promising direction for enhancing not just LF image super-resolution but potentially other computer vision tasks where multiple perspectives and long-range information are critical.

Conclusions and Implications

The introduction of Transformer-based architectures into the LF image super-resolution field, as demonstrated in this paper, exemplifies a noteworthy shift towards leveraging more dynamic and inter-connected data relationships. The method's ability to comprehensively model both angular and spatial dependencies provides a notable enhancement in SR performance, offering practical utility in numerous applications from refocusing to depth sensing and beyond.

Future research may explore further refinements of Transformer models for LF images, exploring aspects such as parameter efficiency, architectural innovations, and real-time applicability. Additionally, extending the utility of such models to broader applications within AI and computer vision, where data multidimensionality is a characteristic feature, opens up new avenues for metaphorically unraveling the complexities of visual data processing.

Markdown