Learning Spatiotemporal Frequency-Transformer for Compressed Video Super-Resolution (2208.03012v1)

Published 5 Aug 2022 in cs.CV

Abstract: Compressed video super-resolution (VSR) aims to restore high-resolution frames from compressed low-resolution counterparts. Most recent VSR approaches often enhance an input frame by borrowing relevant textures from neighboring video frames. Although some progress has been made, there are grand challenges to effectively extract and transfer high-quality textures from compressed videos where most frames are usually highly degraded. In this paper, we propose a novel Frequency-Transformer for compressed video super-resolution (FTVSR) that conducts self-attention over a joint space-time-frequency domain. First, we divide a video frame into patches, and transform each patch into DCT spectral maps in which each channel represents a frequency band. Such a design enables a fine-grained level self-attention on each frequency band, so that real visual texture can be distinguished from artifacts, and further utilized for video frame restoration. Second, we study different self-attention schemes, and discover that a divided attention which conducts a joint space-frequency attention before applying temporal attention on each frequency band, leads to the best video enhancement quality. Experimental results on two widely-used video super-resolution benchmarks show that FTVSR outperforms state-of-the-art approaches on both uncompressed and compressed videos with clear visual margins. Code is available at https://github.com/researchmm/FTVSR.

Authors (4)

Zhongwei Qiu (17 papers)
Huan Yang (306 papers)
Jianlong Fu (91 papers)
Dongmei Fu (19 papers)

Citations (35)

View on Semantic Scholar

Summary

Compressed Video Super-Resolution Using Spatiotemporal Frequency-Transformer

The paper "Learning Spatiotemporal Frequency-Transformer for Compressed Video Super-Resolution" introduces a novel method for enhancing the quality of compressed video using a sophisticated transformer-based architecture. The methodology centers around a Frequency-Transformer (FTVSR), which leverages self-attention mechanisms across joint space-time-frequency domains to restore high-resolution frames from degraded, low-resolution (LR) video sequences.

Efficient video super-resolution (VSR) in the context of compressed data poses substantial challenges due to the inherent quality loss introduced by compression algorithms. Traditional VSR methods often focus on texture transfer from adjacent frames, largely neglecting the compression-induced degradation. This paper addresses these challenges by adopting a distinct frequency-based framework for video restoration.

Key Contributions

Frequency-based Representation: Unlike traditional methods that operate in the spatial domain, FTVSR utilizes the Discrete Cosine Transform (DCT) to convert video frames into frequency-based patch representations. This transformation enables the model to differentiate authentic visual textures from compression artifacts effectively.
Joint Space-Time-Frequency Attention: The proposed architecture investigates various self-attention strategies. Among them, "divided attention"—where space-frequency attention is executed before temporal attention—demonstrates superior performance. This structure facilitates improved visual quality by effectively integrating spatial and temporal information across frequency bands.
State-of-the-art Performance: Experimental evaluations highlight FTVSR's advanced capability to outstrip current methods for both compressed and uncompressed video datasets. On the REDS and Vid4 benchmarks, FTVSR shows remarkable gains in PSNR, offering advancements of approximately 1.6 dB over prior leading methods under compressed conditions.

Theoretical and Practical Implications

The theoretical foundation of this approach underscores the potential of frequency domain operations for enhanced signal recovery in vision tasks. By treating each frequency 'fairly', the proposed method minimizes the risk of magnifying compression artifacts, effectively preserving high-frequency details. Practically, this improvement could translate into more accurate and visually appealing consumer video content, particularly at lower bitrates.

Future Directions

Building on the insights obtained from developing frequency-based VSR, potential avenues for extending this work could involve:

Exploration of alternative frequency-space representations that could yield even finer distinctions between artifacts and textures.
Integration with more advanced neural architectures or training paradigms, such as self-supervised learning, which might reduce dependency on large-scale labeled video datasets.
Application to real-time video processing systems, which would require optimizations to maintain efficacy with reduced computational overhead.

In conclusion, the Spatiotemporal Frequency-Transformer sets a new benchmark for handling compressed video, offering a robust solution for improving the visual quality of degraded sequences. The innovative use of frequency domain features and attention mechanisms could herald a new wave of research advancing video super-resolution techniques further, providing a significant impact on media streaming technologies.

PDF Markdown

Related Papers

GitHub

GitHub - researchmm/FTVSR: [ECCV'22] FTVSR: Learning Spatiotemporal Frequency-Transformer for Compressed Video Super-Resolution (148 stars)