COTR: Correspondence Transformer for Matching Across Images (2103.14167v2)

Published 25 Mar 2021 in cs.CV

Abstract: We propose a novel framework for finding correspondences in images based on a deep neural network that, given two images and a query point in one of them, finds its correspondence in the other. By doing so, one has the option to query only the points of interest and retrieve sparse correspondences, or to query all points in an image and obtain dense mappings. Importantly, in order to capture both local and global priors, and to let our model relate between image regions using the most relevant among said priors, we realize our network using a transformer. At inference time, we apply our correspondence network by recursively zooming in around the estimates, yielding a multiscale pipeline able to provide highly-accurate correspondences. Our method significantly outperforms the state of the art on both sparse and dense correspondence problems on multiple datasets and tasks, ranging from wide-baseline stereo to optical flow, without any retraining for a specific dataset. We commit to releasing data, code, and all the tools necessary to train from scratch and ensure reproducibility.

Authors (5)

Wei Jiang (343 papers)
Eduard Trulls (14 papers)
Jan Hosang (12 papers)
Andrea Tagliasacchi (78 papers)
Kwang Moo Yi (68 papers)

Citations (240)

View on Semantic Scholar

Summary

The paper introduces a transformer-based framework that leverages both global and local image priors for robust correspondence estimation.
It employs a recursive multiscale inference pipeline to iteratively refine matches and significantly enhance localization accuracy.
The method achieves dataset-agnostic performance, consistently outperforming existing techniques on benchmarks like HPatches, KITTI, and ETH3D.

Overview of "COTR: Correspondence Transformer for Matching Across Images"

The paper introduces a novel framework named Correspondence Transformer (COTR), which addresses the problem of finding matches across image pairs utilizing the transformer architecture. This method leverages the capabilities of deep neural networks, particularly transformers, to provide highly accurate image correspondences, whether sparse or dense. Such correspondences are crucial for numerous computer vision tasks, including camera calibration, optical flow, and visual localization.

Key Contributions

Functional Correspondence Model: The paper conceptualizes the correspondence problem as a functional mapping facilitated by a deep neural architecture. Unlike previous methods, COTR utilizes the transformer architecture to effectively handle both local and global priors in the image data, allowing it to learn these priors implicitly.
Multiscale Inference Pipeline: A defining feature of COTR is its recursive application at inference time. By iteratively zooming into an estimated correspondence point and refining it, the model improves its localization accuracy significantly. This multiscale approach is pivotal in achieving superior performance in correspondence estimation.
Dataset Agnostic Performance: The architecture demonstrates robustness and versatility by outperforming existing methods across a range of datasets and tasks without the need for retraining. This includes evaluations on datasets designed for tasks like wide-baseline stereo and optical flow.

Methodological Insights

Combining Dense and Sparse Methods: COTR bridges the divide between traditionally sparse and dense correspondence estimation methods. It achieves this through an architecture that is not constrained by the need for identifying keypoints or dense pixel mapping, allowing queries at any location in the image for correspondences.
Transformer Utilization: By employing transformers, the model capitalizes on the self-attention mechanism, which highlights relevant areas in the image pairs for more precise matching. This choice significantly impacts the model’s capacity to model discontinuous correspondence maps, a challenge for more traditional neural network architectures.

Results and Evaluation

COTR is evaluated on several benchmarks, notably HPatches, KITTI, ETH3D, and the Image Matching Challenge. The results indicate that COTR consistently sets state-of-the-art performance metrics across tasks:

On HPatches, COTR achieves superior Average End Point Error (AEPE) and Percent of Correct Keypoints (PCK), demonstrating its accuracy in both sparse and dense settings.
For KITTI, the model delivers exceptional accuracy in complex scenes with multiple motions, efficiently capturing both global and local dynamics in the imagery.
In the context of ETH3D, COTR maintains high performance even with increasing frame baseline, underscoring its robustness.
The performance on the Image Matching Challenge highlights COTR’s capability to directly impact pose estimation accuracy from image correspondences.

Future Implications

COTR presents significant implications for future research and applications in AI and computer vision. The ability to apply a transformer-based approach to learn correspondences without granular dataset-specific tuning opens pathways for more generalized vision systems. Additionally, future developments might explore augmenting COTR's capabilities with advanced interpolation strategies for even more precise dense mappings.

Overall, the Correspondence Transformer offers a substantial advancement in the field of image correspondence, proposing a scalable, dataset-agnostic solution that can adapt to various image matching contexts effectively. Its success signals a meaningful shift towards leveraging transformer architectures beyond natural language processing, exploring new horizons in computer vision tasks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ShengyHuang/status/1778134018187231237