Learning to Estimate Hidden Motions with Global Motion Aggregation (2104.02409v3)

Published 6 Apr 2021 in cs.CV

Abstract: Occlusions pose a significant challenge to optical flow algorithms that rely on local evidences. We consider an occluded point to be one that is imaged in the first frame but not in the next, a slight overloading of the standard definition since it also includes points that move out-of-frame. Estimating the motion of these points is extremely difficult, particularly in the two-frame setting. Previous work relies on CNNs to learn occlusions, without much success, or requires multiple frames to reason about occlusions using temporal smoothness. In this paper, we argue that the occlusion problem can be better solved in the two-frame case by modelling image self-similarities. We introduce a global motion aggregation module, a transformer-based approach to find long-range dependencies between pixels in the first image, and perform global aggregation on the corresponding motion features. We demonstrate that the optical flow estimates in the occluded regions can be significantly improved without damaging the performance in non-occluded regions. This approach obtains new state-of-the-art results on the challenging Sintel dataset, improving the average end-point error by 13.6% on Sintel Final and 13.7% on Sintel Clean. At the time of submission, our method ranks first on these benchmarks among all published and unpublished approaches. Code is available at https://github.com/zacjiang/GMA

Citations (266)

View on Semantic Scholar

Summary

The paper introduces a Global Motion Aggregation module that integrates transformer-based attention with RAFT to overcome occlusion challenges in optical flow.
It leverages self-similarity to capture long-range dependencies, reducing average endpoint errors by around 13% on Sintel datasets.
The approach advances occlusion handling in computer vision and opens new avenues for applications in tracking, depth estimation, and scene understanding.

Critical Analysis of "Learning to Estimate Hidden Motions with Global Motion Aggregation"

The paper "Learning to Estimate Hidden Motions with Global Motion Aggregation" presents a novel approach to tackle the challenge of occluded point motion estimation in optical flow algorithms. The authors propose a method that aggregates global motion features using self-similarities, leveraging the contemporary transformer framework to address the limitations posed by occlusions, which traditional methods have struggled with due to their reliance on local evidence and brightness constancy assumptions.

Methodological Advancements

The core contribution of the paper is the introduction of the Global Motion Aggregation (GMA) module, which enhances the RAFT (Recurrent All-Pairs Field Transforms) model. The GMA employs a transformer-based paradigm that utilizes an attention mechanism to aggregate motion features on a global scale. This design encapsulates two significant innovations:

Self-Similarity for Long-Range Dependencies: By employing self-similarity measures, the GMA module effectively identifies and leverages long-range dependencies between similar pixels in the reference frame. This approach allows for a robust propagation of motion information from visible to occluded regions, leveraging the homogeneous motion tendency of objects.
Non-Local Aggregation: Unlike convolutional neural networks that are restricted to local operations, the attention mechanism enables non-local aggregation of motion vectors, surpassing the limitations of local interpolation methods.

Empirical Results and Analysis

The proposed method demonstrated significant enhancements in optical flow estimation, particularly in resolving the ambiguities caused by occlusion. The authors reported a 13.6% reduction in average end-point error (EPE) on the Sintel Final dataset and a 13.7% reduction on the Sintel Clean dataset, compared to the RAFT baseline. Furthermore, the GMA module excelled in evaluating optical flow for occluded regions, improving robustness without degrading the performance in non-occluded areas. These findings underscore the method's efficacy in addressing one of the most persistent challenges in optical flow—occlusion.

Implications and Future Prospects

The success of the GMA module suggests several implications for optical flow research and related areas in computer vision. Firstly, it underscores the transformative potential of attention mechanisms and transformers in traditional computer vision tasks, typically dominated by convolutional architectures. Secondly, this approach can inspire future innovations that harness self-similarity and long-range dependencies for other vision problems that suffer from local ambiguity issues, such as depth estimation and scene understanding.

Future work could extend the exploration of how different types of motion information, such as rotational dynamics, could be incorporated into the model. Additionally, the scalability and adaptability of such a transformer-based approach in real-world, resource-constrained applications might warrant further investigation.

Overall, this paper significantly contributes to the optical flow literature by providing a viable solution to the occlusion problem, offering insights into the broader applicability of non-local aggregation methods in computer vision. The implications for practical applications, such as tracking and activity recognition, could be substantial, as accurate motion predictions remain vital for robust performance in these domains.

PDF Markdown

Related Papers

GitHub

GitHub - zacjiang/GMA: Learning to Estimate Hidden Motions with Global Motion Aggregation (ICCV 2021) (312 stars)

YouTube

Show All Videos