TDAN: Temporally Deformable Alignment Network for Video Super-Resolution (1812.02898v1)

Published 7 Dec 2018 in cs.CV

Abstract: Video super-resolution (VSR) aims to restore a photo-realistic high-resolution (HR) video frame from both its corresponding low-resolution (LR) frame (reference frame) and multiple neighboring frames (supporting frames). Due to varying motion of cameras or objects, the reference frame and each support frame are not aligned. Therefore, temporal alignment is a challenging yet important problem for VSR. Previous VSR methods usually utilize optical flow between the reference frame and each supporting frame to wrap the supporting frame for temporal alignment. Therefore, the performance of these image-level wrapping-based models will highly depend on the prediction accuracy of optical flow, and inaccurate optical flow will lead to artifacts in the wrapped supporting frames, which also will be propagated into the reconstructed HR video frame. To overcome the limitation, in this paper, we propose a temporal deformable alignment network (TDAN) to adaptively align the reference frame and each supporting frame at the feature level without computing optical flow. The TDAN uses features from both the reference frame and each supporting frame to dynamically predict offsets of sampling convolution kernels. By using the corresponding kernels, TDAN transforms supporting frames to align with the reference frame. To predict the HR video frame, a reconstruction network taking aligned frames and the reference frame is utilized. Experimental results demonstrate the effectiveness of the proposed TDAN-based VSR model.

Citations (483)

View on Semantic Scholar

Summary

The paper presents a feature-level alignment method that bypasses optical flow to reduce artifacts in high-resolution video reconstruction.
It employs dynamic offset prediction with both reconstruction and alignment losses to enable robust, end-to-end training.
TDAN outperforms existing methods like TOFlow, DUF, and SPMC by achieving superior PSNR and SSIM with a lightweight design.

Temporally Deformable Alignment Network for Video Super-Resolution

The paper presents the Temporally Deformable Alignment Network (TDAN), a novel approach for video super-resolution (VSR) that introduces a one-stage temporal alignment mechanism. As VSR tasks typically involve reconstructing high-resolution (HR) video frames from lower-resolution (LR) input, temporal alignment across frames becomes vital due to potential motion between frames. Traditional methods using optical flow for this alignment face challenges related to prediction accuracy and resultant artifacts. TDAN addresses these issues by implementing alignment at the feature level without relying on optical flow.

Methodology

TDAN operates by dynamically predicting offsets for sampling convolution kernels directly from input features, aligning the reference and supporting frames at the feature level. This method avoids explicit motion estimation, thereby reducing potential artifacts from inaccurate optical flow calculations. The aligned frames are subsequently used in an SR reconstruction network to predict the HR frame. The model employs a reconstruction loss ( $\mathcal{L}_{sr}$ ) and an alignment loss ( $\mathcal{L}_{align}$ ), supporting robust, end-to-end training without additional supervision requirements.

Experimental Results

The paper conducts experiments on a range of configurations, demonstrating TDAN’s superior performance compared to existing methods. The results highlight the capabilities of TDAN in accurately restoring image details, outperforming previous VSR approaches such as TOFlow, DUF, and SPMC, both in PSNR and SSIM metrics. Notably, TDAN achieves these results with a relatively lightweight model, showcasing its efficiency.

Implications and Future Work

From a practical standpoint, TDAN's feature-level alignment introduces greater flexibility and robustness, removing dependence on optical flow predictions. This adaptability could pave the way for advancements in related video processing tasks, like video denoising and deblurring. Theoretically, TDAN suggests a shift in approach that favors implicit motion handling within learning frameworks, potentially influencing future VSR model designs.

Future work might explore deeper VSR architectures facilitated by more extensive datasets, alongside potential enhancements in temporal fusion methodologies within TDAN’s architecture. Another avenue for investigation is the development of algorithms adept at learning under label noise to further optimize the alignment objectives. These improvements hold promise for enhanced video quality, application to high-resolution datasets, and broader adoption in dynamic real-world scenarios.

In conclusion, TDAN's approach to temporal alignment manifests a significant advancement in VSR, indicating a promising direction for ongoing research and practical application in video enhancement technologies.

PDF Markdown