FlowFormer: A Transformer Architecture for Optical Flow (2203.16194v4)

Published 30 Mar 2022 in cs.CV

Abstract: We introduce optical Flow transFormer, dubbed as FlowFormer, a transformer-based neural network architecture for learning optical flow. FlowFormer tokenizes the 4D cost volume built from an image pair, encodes the cost tokens into a cost memory with alternate-group transformer (AGT) layers in a novel latent space, and decodes the cost memory via a recurrent transformer decoder with dynamic positional cost queries. On the Sintel benchmark, FlowFormer achieves 1.159 and 2.088 average end-point-error (AEPE) on the clean and final pass, a 16.5% and 15.5% error reduction from the best published result (1.388 and 2.47). Besides, FlowFormer also achieves strong generalization performance. Without being trained on Sintel, FlowFormer achieves 1.01 AEPE on the clean pass of Sintel training set, outperforming the best published result (1.29) by 21.7%.

Citations (229)

View on Semantic Scholar

Summary

The paper introduces FlowFormer, a transformer-based architecture that constructs and processes a 4D cost volume for more accurate optical flow estimation.
The methodology leverages an alternate-group transformer layer to efficiently encode high-dimensional cost data and iteratively refines flow predictions via a cost memory decoder.
Empirical results on the MPI Sintel benchmark report a 21.7% improvement, demonstrating FlowFormer’s strong generalization and effective transfer learning capability.

An Expert Analysis of the FlowFormer: A Transformer Architecture for Optical Flow

The paper introduces a novel transformer architecture named FlowFormer, specifically designed for optical flow estimation, which involves deducing per-pixel correspondences between image pairs in visual data. Optical flow's significance spans multiple computer vision tasks, such as video inpainting, action recognition, and video super-resolution, making advancements in this domain pivotal across a variety of applications.

The proposed FlowFormer leverages a transformer-based model to encode and decode the 4D cost volume derived from an image pair. This architecture integrates the strengths of conventional cost volume techniques used in convolutional neural networks (CNNs) with the long-range modeling capabilities of transformers, aiming to improve the accuracy and generalization of optical flow estimation.

Architecture Overview

FlowFormer operates using three main components: a 4D cost volume constructor, a cost volume encoder, and a cost memory decoder.

4D Cost Volume Construction: A 4D cost volume is constructed by calculating the dot-product similarities between feature pairs across two images. This cost volume serves as a foundational representation to extract motion information.
Cost Volume Encoder: Transforming the high-dimensional 4D cost data requires an efficient encoding strategy. FlowFormer introduces alternative grouping and token formation methods to compress this data into a latent cost space. A key component here is the alternate-group transformer (AGT) layer, which interleaves intra-cost and inter-cost map attentions, allowing the model to effectively process this reduced representation while maintaining computational feasibility.
Cost Memory Decoder: In this recurrent layer, FlowFormer refines flow predictions iteratively. By dynamically querying the cost memory using positional queries, the model continuously corrects and updates flow estimates, thus achieving higher accuracy in optical flow predictions.

Results and Contributions

FlowFormer shows noteworthy improvements over previous models:

On the MPI Sintel benchmark, FlowFormer outperforms existing solutions significantly, achieving lower Average End-Point Error (AEPE) metrics on both clean and final pass datasets.
Without fine-tuning on the Sintel training set, FlowFormer achieves a 21.7% improvement over the best-published result, underscoring its strong generalization capabilities.
Empirically, the integration of ImageNet-pretrained transformer backbones further enhances performance, indicating FlowFormer's versatile compatibility and potential for transfer learning.

Theoretical and Practical Implications

The introduction of FlowFormer exemplifies an effective marriage between classical optical flow methodologies—historically reliant on optimization techniques and local structure—and modern transformer frameworks known for their prowess in modeling context and relationships. The success of FlowFormer suggests several directions for future research and application:

Enhanced Transfer Learning: Leveraging pretrained networks effectively, FlowFormer can adapt to various vision-based contexts, making it an attractive candidate for machine learning tasks where annotated data is limited.
Scalable Architecture: The efficient encoding strategies employed may serve as a blueprint for other high-dimensional vision tasks requiring long-range dependency modeling.

Conclusion

FlowFormer signifies a step forward in advancing optical flow estimation by combining computationally efficient representation techniques with deep learning's contextual modeling capabilities. This dual approach offers significant enhancements in performance and generalization, emphasizing the potential of applying transformer architectures to traditional pixel-based vision tasks. Future explorations might investigate further integration with other deep learning paradigms to expand this methodology's utility across more complex visual scenarios.

PDF Markdown