Papers
Topics
Authors
Recent
2000 character limit reached

ViTC-UReID: Unsupervised ReID with Vision Transformers

Updated 11 January 2026
  • The paper introduces ViTC-UReID, a novel unsupervised approach that integrates Vision Transformer-based feature encoding with camera-aware and token-level proxy learning.
  • It employs a two-stage learning process: generating pseudo-labels via DBSCAN clustering and refining feature representations using dual contrastive losses for identity and camera discrimination.
  • The method achieves impressive performance improvements on datasets like Market-1501 and MSMT-17, demonstrating significant gains in mAP and rank-1 metrics over previous unsupervised methods.

ViTC-UReID is a fully unsupervised person re-identification (ReID) framework that unifies Vision Transformer (ViT) feature encoding, enhanced token-level modeling, and advanced proxy-based instance discrimination. It is designed to address major challenges in ReID—including appearance variation, camera domain shifts, and unlabeled datasets—by leveraging a two-stage learning process: pseudo-label generation through clustering and unsupervised representation learning based on proxy memories and camera-aware contrastive objectives. State-of-the-art results have been demonstrated across Market-1501, DukeMTMC-reID, MSMT17, and CUHK03, showing substantial advances in unsupervised ReID performance (Pham et al., 4 Jan 2026, Zhu et al., 15 Jan 2025).

1. Architectural Components and Workflow

ViTC-UReID applies an iterative process comprising two main stages:

  • Pseudo-label Generation: Patch-token features are extracted for all images using a pretrained Vision Transformer (ViT-B/16). The global CLS features are clustered via DBSCAN to produce “identity” clusters, which are subsequently subdivided by camera ID, yielding camera-aware subclusters. Two proxy memories are constructed: (i) cluster-center memory for each identity and (ii) camera-center memory for each camera within an identity cluster. Uniquely, outlier samples initially unassigned by DBSCAN are not discarded but are forcibly assigned to the nearest cluster centroid in subsequent epochs (Zhu et al., 15 Jan 2025).
  • Unsupervised Representation Learning: For each input image, ViTC-UReID computes an “enhanced” feature by concatenating the global CLS token with the top-K local tokens, selected via self-attention scoring and further processed by an MLP and MaxPool operation. The model applies two contrastive-style losses: ClusterNCE (identity discrimination by cluster proxies) and a camera-aware loss (CameraNCE) over camera-center proxies. The total loss for an image sample xix_i is: Li=LNCE(vi,yi;M)+λLCAP(vi,pi;C)L_i = L_\mathrm{NCE}(v_i, y_i; M) + \lambda \cdot L_\mathrm{CAP}(v_i, p_i; C) where viv_i is the enhanced embedding, yiy_i is the cluster label, and pip_i is the camera-cluster label. Parameters are updated via SGD, and proxies and clusters are refreshed after each epoch.

2. Proxy-Based and Camera-Aware Learning Objectives

To simultaneously boost feature discrimination and camera invariance, ViTC-UReID introduces two pivotal contrastive objectives:

  • ClusterNCE: Enforces identity-level discrimination by contrasting features against their corresponding cluster proxies: LNCE(vi,yi;M)=logexp(vim+/τ)k=1Kexp(vimk/τ)L_\mathrm{NCE}(v_i, y_i; M) = -\log \frac{\exp( v_i \cdot m^+ / \tau )}{\sum_{k=1}^K \exp( v_i \cdot m_k / \tau )} with m+m^+ the positive proxy for yiy_i, and τ\tau a temperature parameter.
  • CameraNCE: Incorporates camera-awareness by contrasting features against proxies specific to both identity and camera: LCAP(vi,pi;C)=1P(i)bP(i)log[exp(vicpi,b/τc)k,bexp(vick,b/τc)]L_\mathrm{CAP}(v_i, p_i; C) = - \frac{1}{|P(i)|} \sum_{b \in P(i)} \log \left[ \frac{\exp( v_i \cdot c_{p_i, b} / \tau_c )} { \sum_{k, b'} \exp( v_i \cdot c_{k, b'} / \tau_c ) } \right] where cpi,bc_{p_i, b} is the mean feature for cluster pip_i in camera bb, and τc\tau_c is a temperature for the camera proxy loss. The weight λ\lambda balances the two objectives (empirically set to $0.7$ or $1.0$) (Pham et al., 4 Jan 2026).

3. Token-Level Modeling and Multi-Scale Memory (TCMM Variant)

In alternative implementations, such as the TCMM variant (Zhu et al., 15 Jan 2025), ViTC-UReID introduces explicit token-level constraints and a multi-scale memory:

  • ViT Token Constraint: Within each feature sequence, a fraction α\alpha of patch tokens is randomly selected to act as "negative" tokens. A constraint loss penalizes these tokens’ similarity to the global CLS token, limiting the impact of spurious or noisy image patches: Lconstraint=1Bb=1BiNblogσ(Ai(ti,tcls)/τ)L_\mathrm{constraint} = - \frac{1}{B} \sum_{b=1}^B \sum_{i \in N_b} \log \sigma \left( A_i(t_i, t_{cls}) / \tau \right)
  • Multi-Scale Memory: Dual memory banks maintain both instance-level and prototype-level features. An instance memory stores every feature with its pseudo-label, while the prototype memory tracks cluster centroids. These are momentum-updated each iteration. Losses are imposed both at the prototype (cluster) and instance level, complementing the token constraint loss.

Combined, these mechanisms suppress patch noise and leverage both global and fine-grained memory, reinforcing robust cluster separation and intra-class compactness.

4. Training Protocol and Hyperparameters

The training process follows these steps:

  1. Pretraining: The ViT-B/16 backbone is pretrained (e.g., on LUPerson).
  2. Epoch Loop (typically E=60E=60):
    • Extract CLS features for all images.
    • Run DBSCAN clustering (with dataset-adaptive settings: ϵ=0.6\epsilon = 0.6, variable min samples).
    • Sub-cluster identities by camera.
    • Update cluster and camera proxies by averaging.
    • Shuffle dataset, construct minibatches (batch size B=128B=128 or $64$).
    • For each batch, compute enhanced features, cluster/camera proxy losses (and optionally token constraint losses), and backpropagate via SGD.
    • Reduce learning rate at predefined epochs.

Main protocol hyperparameters are summarized as follows:

Parameter Setting (Market) Setting (MSMT)
Backbone ViT-B/16 (LUPerson) ViT-B/16 (LUPerson)
Batch Size 128 128
Learning Rate 1e-3 (decay ×0.1) 1e-3 (decay ×0.1)
DBSCAN eps/minpts 0.6 / 8 0.6 / 16
Top-K locals 40% tokens 40% tokens
Proxy Momentum 0.2 0.2
λ\lambda (camera loss) 0.7 1.0
Epochs 60 60

5. Quantitative Performance

ViTC-UReID achieves leading results for unsupervised person re-identification, as measured by mean average precision (mAP) and Cumulative Matching Characteristic (CMC) rank-1 metrics:

Dataset mAP Rank-1 Rank-5 Rank-10
Market-1501 92.8 97.1 99.1 99.5
MSMT-17 63.6 85.8 92.3 94.1
CUHK03(L) 89.8 91.1 95.1 97.1

On Market-1501, mAP and rank-1 exceed all prior fully unsupervised methods, improving over DAPRH by 6.9 mAP points and over the IQAGA UDA approach by over 56 mAP points when no labeled source data is used. Ablation studies confirm that both enhanced local-global features and the camera-aware proxy loss contribute significant independent gains: for example, on MSMT, camera-aware proxy learning alone boosts mAP by +9.6 and rank-1 by +6.5, compared to an identity-only baseline (Pham et al., 4 Jan 2026). TCMM reports Market-1501 performance of mAP 83.2 and rank-1 92.1 (Zhu et al., 15 Jan 2025).

6. Analysis of Key Mechanisms and Ablations

Ablation experiments demonstrate the synergy between cluster-level, camera-aware, and token-level mechanisms:

Component Market mAP / R1 MSMT mAP / R1
Baseline (ClusterNCE) 90.7 / 96.4 53.3 / 78.9
+ EIR (CLS+Locals) 92.5 / 96.5 55.7 / 80.6
+ CAP (Camera loss) 92.0 / 96.6 62.9 / 85.4
Full (EIR + CAP) 92.8 / 97.1 63.6 / 85.8

Removing token constraints or memory components in TCMM reduces mAP by 3–4 points on Market. Discarding outlier re-mining (assigning hard DBSCAN outliers) further decreases mAP by up to 1.5. This demonstrates that token-level suppression mitigates patch errors, while memory components and camera-aware proxying are necessary for robust cross-camera matching (Zhu et al., 15 Jan 2025).

7. Significance and Position within ReID Research

ViTC-UReID advances unsupervised ReID by:

  • Leveraging the transformer architecture, enabling global and self-attention aggregation of visual cues, outperforming convolutional alternatives (e.g., ResNet-50) on large-scale, multi-camera datasets.
  • Introducing camera-aware proxies, which enforce invariance to camera-specific artifacts directly in the representation.
  • Implementing token-level suppression to handle transformer susceptibility to background noise and patch artifacts, as evidenced by TCMM.
  • Providing a fully unsupervised protocol: no labeled target or source data, pseudo-labels derived online via clustering, and no requirement for handcrafted attributes or domain-specific pretraining.

The approach has established new state-of-the-art baselines and demonstrated generalizability across multiple datasets, supporting its utility for real-world surveillance deployments and related transfer-free ReID scenarios (Pham et al., 4 Jan 2026, Zhu et al., 15 Jan 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to ViTC-UReID.