Learning Trajectory-Aware Transformer for Video Super-Resolution (2204.04216v3)

Published 8 Apr 2022 in eess.IV and cs.CV

Abstract: Video super-resolution (VSR) aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts. Although some progress has been made, there are grand challenges to effectively utilize temporal dependency in entire video sequences. Existing approaches usually align and aggregate video frames from limited adjacent frames (e.g., 5 or 7 frames), which prevents these approaches from satisfactory results. In this paper, we take one step further to enable effective spatio-temporal learning in videos. We propose a novel Trajectory-aware Transformer for Video Super-Resolution (TTVSR). In particular, we formulate video frames into several pre-aligned trajectories which consist of continuous visual tokens. For a query token, self-attention is only learned on relevant visual tokens along spatio-temporal trajectories. Compared with vanilla vision Transformers, such a design significantly reduces the computational cost and enables Transformers to model long-range features. We further propose a cross-scale feature tokenization module to overcome scale-changing problems that often occur in long-range videos. Experimental results demonstrate the superiority of the proposed TTVSR over state-of-the-art models, by extensive quantitative and qualitative evaluations in four widely-used video super-resolution benchmarks. Both code and pre-trained models can be downloaded at https://github.com/researchmm/TTVSR.

Authors (4)

Chengxu Liu (9 papers)
Huan Yang (306 papers)
Jianlong Fu (91 papers)
Xueming Qian (31 papers)

Citations (72)

View on Semantic Scholar

Summary

Learning Trajectory-Aware Transformer for Video Super-Resolution

The paper "Learning Trajectory-Aware Transformer for Video Super-Resolution" by Chengxu Liu et al. presents a novel approach to enhance the quality of low-resolution video frames by leveraging long-range temporal dependencies. The authors introduce the Trajectory-aware Transformer for Video Super-Resolution (TTVSR), addressing key limitations in existing methods which are primarily constrained to analyzing limited adjacent frames.

Overview

Video Super-Resolution (VSR) is a significant task in computer vision with practical applications in areas such as video surveillance, high-definition television, and satellite imagery. The challenge of VSR lies in exploiting temporal dependencies effectively across entire video sequences rather than relying on narrow temporal windows. Traditional methods often utilize frames from a limited window size (e.g., 5 or 7 frames) leading to suboptimal outcomes due to computational constraints and an inability to capture long-range dependencies.

Proposed Approach

TTVSR innovatively employs a Transformer model, typically used in natural language processing, to perform video super-resolution. Transformers are excellent at modeling long-range dependencies due to their self-attention mechanism. The novel contribution of this paper is the development of a Trajectory-aware Transformer, designed to learn from extended temporal sequences effectively.

Key features of the TTVSR include:

Trajectory Formation: The approach formulates video frames into pre-aligned trajectories composed of continuous visual tokens. This structural representation helps in linking relevant tokens along the temporal dimension efficiently.
Self-Attention on Trajectories: By calculating self-attention along these spatio-temporal trajectories, the approach reduces the computational cost inherent in typical Transformer models, which process attention across all spatial dimensions.
Cross-Scale Feature Tokenization: This module addresses the scale variations often present in long-range sequences by enhancing feature representations from multiple scales, thereby improving the model's ability to utilize detailed texture information.

Experimental Evaluation

The paper presents robust experimental results, showing that the proposed TTVSR model outperforms state-of-the-art methods in both quantitative and qualitative assessments across four widely-used VSR benchmarks. Notably, the model achieves significant improvements in PSNR values, for example gaining 0.70dB over BasicVSR and 0.45dB over IconVSR in the challenging REDS4 dataset.

Implications and Future Directions

The introduction of trajectory-aware mechanisms into Transformer architectures for VSR tasks has several implications:

Efficiency in Long-Range Modeling: The proposed approach effectively reduces the computational burden associated with long-range modeling in video sequences, which is a substantial step towards making such methods feasible for real-time applications.
Applications in Other Vision Tasks: This trajectory-aware methodology might be extendable to other tasks where understanding dynamic content over extended periods is crucial. This could include action recognition, video classification, and beyond.
Potential for Further Optimization: Future work could explore optimizing the tokenization and trajectory determination phases further, potentially reducing training times and improving scalability across different hardware architectures.

The TTVSR marks a significant contribution to the VSR landscape by integrating Transformer models with trajectory concepts, demonstrating tangible improvements in video quality through effective exploitation of spatial and temporal information. This work paves the way for further explorations into more efficient and robust models for video tasks, enhancing the intersection of Transformer-based models and computer vision applications.

PDF Markdown

Related Papers

GitHub

GitHub - researchmm/TTVSR: [CVPR'22 Oral] TTVSR: Learning Trajectory-Aware Transformer for Video Super-Resolution (199 stars)

Tweets

https://twitter.com/_akhaliq/status/1513693561312755712

https://twitter.com/marcusedel/status/1513706613739343877

https://twitter.com/yapp1e/status/1513852965190352896