Self-Supervised Pre-Training for Transformer-Based Person Re-Identification (2111.12084v1)

Published 23 Nov 2021 in cs.CV

Abstract: Transformer-based supervised pre-training achieves great performance in person re-identification (ReID). However, due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset (e.g. ImageNet-21K) to boost the performance because of the strong data fitting ability of the transformer. To address this challenge, this work targets to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure, respectively. We first investigate self-supervised learning (SSL) methods with Vision Transformer (ViT) pretrained on unlabelled person images (the LUPerson dataset), and empirically find it significantly surpasses ImageNet supervised pre-training models on ReID tasks. To further reduce the domain gap and accelerate the pre-training, the Catastrophic Forgetting Score (CFS) is proposed to evaluate the gap between pre-training and fine-tuning data. Based on CFS, a subset is selected via sampling relevant data close to the down-stream ReID data and filtering irrelevant data from the pre-training dataset. For the model structure, a ReID-specific module named IBN-based convolution stem (ICS) is proposed to bridge the domain gap by learning more invariant features. Extensive experiments have been conducted to fine-tune the pre-training models under supervised learning, unsupervised domain adaptation (UDA), and unsupervised learning (USL) settings. We successfully downscale the LUPerson dataset to 50% with no performance degradation. Finally, we achieve state-of-the-art performance on Market-1501 and MSMT17. For example, our ViT-S/16 achieves 91.3%/89.9%/89.6% mAP accuracy on Market1501 for supervised/UDA/USL ReID. Codes and models will be released to https://github.com/michuanhaohao/TransReID-SSL.

Authors (8)

Hao Luo (112 papers)
Pichao Wang (65 papers)
Yi Xu (304 papers)
Feng Ding (72 papers)
Yanxin Zhou (8 papers)
Fan Wang (313 papers)
Hao Li (803 papers)
Rong Jin (164 papers)

Citations (47)

View on Semantic Scholar

Summary

Self-Supervised Pre-Training for Transformer-Based Person Re-Identification: A Comprehensive Overview

The paper under discussion presents a focused examination of self-supervised pre-training methods for enhancing the performance of transformer-based models in the person re-identification (ReID) domain. Transformer models trained with supervised learning have demonstrated efficacy in ReID tasks, but these models often require large-scale datasets like ImageNet-21K due to a significant domain gap between such general datasets and ReID-specific data. This research aims to address the domain gap issue by investigating an alternative pre-training strategy leveraging self-supervised learning (SSL) specifically for ReID, examining both data selection and model architectural modifications.

The crux of the proposed approach lies in utilizing the LUPerson dataset, which consists of unlabeled data specific to ReID. The authors evaluate several SSL methods with Vision Transformers (ViTs) pre-trained on unlabelled person images. Empirical evaluation reveals that the DINO method significantly surpasses other SSL strategies when applied to ReID tasks. Notably, ViT models fine-tuned with DINO's SSL paradigm outperform those pre-trained on ImageNet, reflecting the benefits of domain-specific pre-training.

Furthermore, to reduce the computational burden and accelerate model pre-training while maintaining sufficient accuracy, the authors introduce the Catastrophic Forgetting Score (CFS). CFS assesses the domain gap between pre-training and fine-tuning datasets, facilitating the selection of a relevant subset of LUPerson data closer to ReID datasets. This subset selection not only mitigates the domain gap but also ensures efficient use of pre-training resources by effectively halving the dataset size with no degradation in performance.

Complementing the data-driven approach, the paper introduces a model architectural enhancement: the IBN-based convolution stem (ICS). ICS aims to address the performance instability observed in ViTs due to their default patch embedding strategies. By integrating instance normalization (IN) with batch normalization (BN), ICS tends to capture domain-invariant features that are crucial for the ReID task, thereby further bridging the source-target domain gap.

The proposed combined strategy's efficacy is affirmed by the extensive experimentation across various ReID benchmarks, including both supervised and unsupervised learning scenarios. The results indicate that the proposed approach achieves state-of-the-art performance on datasets such as Market-1501 and MSMT17, with notable improvements in mean Average Precision (mAP) and Rank-1 accuracy. For instance, under supervised learning conditions, the presented ViT-S/16 model achieves impressive accuracy metrics, outperforming models utilizing traditional ImageNet pre-training.

The research contributes primarily to the theoretical and practical advancement of self-supervised learning for ReID by demonstrating that domain-relevant pre-training with selectively curated data can substantially enhance transformer-based models' performance. This aligns well with the evolving narrative in AI research, where domain-specific pre-training is increasingly being recognized as a viable paradigm compared to one-size-fits-all solutions.

Future investigations could explore the broader applicability of the CFS approach in other open-set recognition tasks and further refinements in model architecture that better exploit the intrinsic temporal and spatial features of ReID data. The potential for cross-domain transferability, where a model pre-trained on ReID data could inform other related domains such as action recognition or anomaly detection, also represents a worthwhile avenue for longitudinal research in AI model development. The authors provide reproducible code and models, inviting the research community to build upon their work and explore these frontiers.

In summary, this paper makes a substantiated contribution to the person ReID field by presenting a well-rounded approach to reducing the domain gap through selective data sampling and model architecture enhancements, validated by robust experimental results.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - damo-cv/TransReID-SSL: Self-Supervised Pre-Training for Transformer-Based Person Re-Identification (190 stars)