TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification (2105.14432v2)

Published 30 May 2021 in cs.CV

Abstract: Transformers have recently gained increasing attention in computer vision. However, existing studies mostly use Transformers for feature representation learning, e.g. for image classification and dense predictions, and the generalizability of Transformers is unknown. In this work, we further investigate the possibility of applying Transformers for image matching and metric learning given pairs of images. We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention. Thus, we further design two naive solutions, i.e. query-gallery concatenation in ViT, and query-gallery cross-attention in the vanilla Transformer. The latter improves the performance, but it is still limited. This implies that the attention mechanism in Transformers is primarily designed for global feature aggregation, which is not naturally suitable for image matching. Accordingly, we propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity computation. Additionally, global max pooling and a multilayer perceptron (MLP) head are applied to decode the matching result. This way, the simplified decoder is computationally more efficient, while at the same time more effective for image matching. The proposed method, called TransMatcher, achieves state-of-the-art performance in generalizable person re-identification, with up to 6.1% and 5.7% performance gains in Rank-1 and mAP, respectively, on several popular datasets. Code is available at https://github.com/ShengcaiLiao/QAConv.

Authors (2)

Shengcai Liao (46 papers)
Ling Shao (244 papers)

Citations (46)

View on Semantic Scholar

Summary

Deep Dive into TransMatcher: Transforming Person Re-Identification with Efficient Image Matching

The paper "TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification" by Shengcai Liao and Ling Shao explores the utilization of Transformers for image matching in the context of person re-identification (Re-ID). Traditionally, Transformers have demonstrated considerable success in various computer vision tasks like classification and object detection but pose challenges in image matching due to the absence of image-to-image interactions. To address this gap, this research introduces a novel architectural framework, TransMatcher, which leverages a simplified decoder designed explicitly for similarity computation in image matching tasks.

Key Contributions and Methodology

The paper delineates several innovative contributions in the domain of image matching using Transformers:

Transformer Adaptation for Image Matching: The paper systematically investigates how traditional Vision Transformers (ViT) and vanilla Transformers can be adapted for image matching. It underlines the limitations posed by Transformers' inherent global feature aggregation design, which lacks inherent mechanisms for cross-image interaction.
Design of TransMatcher: Addressing the deficiencies identified, the authors propose TransMatcher, which incorporates a new and simplified decoder. This decoder eschews full attention mechanisms, instead using query-key similarity computation, effectively tailored for image matching. The architecture further refines the matching results using global max pooling (GMP) and an MLP head, enhancing computational efficiency and accuracy.
Performance Evaluation: Rigorous experiments on multiple person Re-ID datasets including CUHK03, Market-1501, and MSMT17 validate TransMatcher's efficacy, showcasing performance improvements of up to 6.1% in Rank-1 accuracy and 5.7% in mAP, thus setting a new benchmark in generalizable person Re-ID.

The core innovation in TransMatcher lies in its capability to perform efficient image matching by focusing explicitly on query-key similarity computations rather than traditional softmax weighted global feature aggregations seen in standard Transformers. This reorientation is crucial for capturing the nuanced cross-image interactions necessary for effective image matching.

Implications and Future Directions

The results underscore the potential of Transformer-based architectures specifically tailored for image matching and metric learning tasks. This suggests not only practical enhancements in person re-identification systems but also theoretical insights into how global and local feature relations can be optimized in transformer architectures for matching applications.

Future research may explore extending TransMatcher's capabilities to other domains requiring robust image matching, such as image retrieval and instance-level recognition tasks. Additionally, further work could delve into optimizing the transformer attention mechanisms to balance between computational efficiency and model accuracy, thereby scaling to larger datasets or real-time applications.

In sum, TransMatcher represents a significant stride in adapting Transformer architectures to the distinct challenges posed by image matching, with promising applications across diverse recognition and identification paradigms in computer vision.

PDF Markdown

Related Papers

GitHub

GitHub - ShengcaiLiao/QAConv: [ECCV 2020] QAConv: Interpretable and Generalizable Person Re-Identification with Query-Adaptive Convolution and Temporal Lifting, and [CVPR 2022] GS: Graph Sampling Based Deep Metric Learning (202 stars)