CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow (2203.16896v1)

Published 31 Mar 2022 in cs.CV

Abstract: Optical flow estimation aims to find the 2D motion field by identifying corresponding pixels between two images. Despite the tremendous progress of deep learning-based optical flow methods, it remains a challenge to accurately estimate large displacements with motion blur. This is mainly because the correlation volume, the basis of pixel matching, is computed as the dot product of the convolutional features of the two images. The locality of convolutional features makes the computed correlations susceptible to various noises. On large displacements with motion blur, noisy correlations could cause severe errors in the estimated flow. To overcome this challenge, we propose a new architecture "CRoss-Attentional Flow Transformer" (CRAFT), aiming to revitalize the correlation volume computation. In CRAFT, a Semantic Smoothing Transformer layer transforms the features of one frame, making them more global and semantically stable. In addition, the dot-product correlations are replaced with transformer Cross-Frame Attention. This layer filters out feature noises through the Query and Key projections, and computes more accurate correlations. On Sintel (Final) and KITTI (foreground) benchmarks, CRAFT has achieved new state-of-the-art performance. Moreover, to test the robustness of different models on large motions, we designed an image shifting attack that shifts input images to generate large artificial motions. Under this attack, CRAFT performs much more robustly than two representative methods, RAFT and GMA. The code of CRAFT is is available at https://github.com/askerlee/craft.

Citations (87)

View on Semantic Scholar

Summary

The paper presents CRAFT, a novel cross-attentional transformer architecture that refines correlation volumes to robustly estimate optical flow even under motion blur and large displacements.
It achieves state-of-the-art performance on Sintel and KITTI benchmarks, demonstrating impressive stability against noise and image shifting attacks.
Ablation studies confirm that components like SSTrans and cross-frame attention are essential for suppressing spurious correlations and improving pixel matching.

CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow

The paper presents a novel approach to optical flow estimation through the development of a new architecture called the Cross-Attentional Flow Transformer (CRAFT). Optical flow estimation is a fundamental task in computer vision, aiming to establish pixel-wise correspondences across consecutive video frames to measure the 2D motion fields. Traditional methods leveraging convolutional neural networks (CNNs) have seen substantial advancements, yet they struggle with large displacements and motion blur due to susceptibility to noise in pixel matching correlation volumes.

Architectural Innovations

CRAFT introduces critical innovations to enhance the calculation of correlation volumes, pivotal in matching pixels across frames. Specifically, CRAFT incorporates:

Semantic Smoothing Transformer (SSTrans): Applied to Frame-2 features, this transformer layer infuses more global context and semantic stability, reducing susceptibility to noise and spurious correlations often encountered with CNN-based approaches.
Cross-Frame Attention in Correlations: Utilizing an Expanded Attention mechanism, CRAFT replaces conventional dot-product calculations with a more nuanced attention method that includes Query and Key projections. This refines the correlation volumes by filtering noisy feature interactions, thus enhancing accuracy.

Empirical Validations

CRAFT demonstrates superior performance against existing state-of-the-art methods, notably RAFT and GMA, through extensive benchmarking on Sintel and KITTI datasets. Noteworthy results include achieving top performance on Sintel (Final) and KITTI foreground benchmarks, signifying robustness especially under challenging conditions like motion blur and extreme displacements.

To further stress-test robustness, the authors designed an image shifting attack which artificially enlarges motion magnitudes in input images. Under this attack, CRAFT maintained significantly higher stability in flow estimation compared to RAFT and GMA, underscoring its reliability in real-world scenarios characterized by large and abrupt motion changes.

Detailed Model Evaluation

The paper meticulously deconstructs CRAFT's contributions and performs ablation studies to assess the individual impact of the SSTrans, Cross-Frame Attention, and the GMA module. Removing any of these components results in notable performance degradation, emphasizing their necessity in the architecture.

Additionally, visualization efforts highlight essential improvements in suppressing spurious correlations within computed volumes. CRAFT's attention mechanisms visibly yield smoother and more precise matching of pixels across frames, as illustrated through qualitative analyses of correlation matrices.

Speculative Future Directions

Given these advancements, CRAFT's architecture not only strengthens the robustness and accuracy of optical flow estimation but also opens avenues for further research in video frame analysis under diverse scenarios. The integration of transformers in correlation volume computation suggests potential exploration into hybrid approaches, possibly combining attention mechanisms with other neural architectures for improved contextual understanding and computational efficiency.

Moreover, the robustness against significant shifts hints at possible applications in domains requiring rapid dynamic analysis, such as autonomous driving, robotics, and surveillance systems. Future investigations may delve into broader applications of these architectural insights across other domains of AI and data-driven modeling, where noise reduction and spatial contextualization are critical.

In conclusion, CRAFT stands as a commendable advancement in optical flow estimation, leveraging innovative transformer-based strategies to address longstanding challenges in the field. Through systematic evaluations, the paper establishes foundational improvements which likely imply broader implications for theoretical and practical engagements with dynamic visual data processing.

PDF Markdown

Related Papers

GitHub

GitHub - askerlee/craft: CRAFT: Cross-Attentional Flow Transformers for Optical Flow Estimation (75 stars)