ViTC-UReID: Unsupervised ReID with Vision Transformers

Updated 11 January 2026

The paper introduces ViTC-UReID, a novel unsupervised approach that integrates Vision Transformer-based feature encoding with camera-aware and token-level proxy learning.
It employs a two-stage learning process: generating pseudo-labels via DBSCAN clustering and refining feature representations using dual contrastive losses for identity and camera discrimination.
The method achieves impressive performance improvements on datasets like Market-1501 and MSMT-17, demonstrating significant gains in mAP and rank-1 metrics over previous unsupervised methods.

ViTC-UReID is a fully unsupervised person re-identification (ReID) framework that unifies Vision Transformer (ViT) feature encoding, enhanced token-level modeling, and advanced proxy-based instance discrimination. It is designed to address major challenges in ReID—including appearance variation, camera domain shifts, and unlabeled datasets—by leveraging a two-stage learning process: pseudo-label generation through clustering and unsupervised representation learning based on proxy memories and camera-aware contrastive objectives. State-of-the-art results have been demonstrated across Market-1501, DukeMTMC-reID, MSMT17, and CUHK03, showing substantial advances in unsupervised ReID performance (Pham et al., 4 Jan 2026, Zhu et al., 15 Jan 2025).

1. Architectural Components and Workflow

ViTC-UReID applies an iterative process comprising two main stages:

Pseudo-label Generation: Patch-token features are extracted for all images using a pretrained Vision Transformer (ViT-B/16). The global CLS features are clustered via DBSCAN to produce “identity” clusters, which are subsequently subdivided by camera ID, yielding camera-aware subclusters. Two proxy memories are constructed: (i) cluster-center memory for each identity and (ii) camera-center memory for each camera within an identity cluster. Uniquely, outlier samples initially unassigned by DBSCAN are not discarded but are forcibly assigned to the nearest cluster centroid in subsequent epochs (Zhu et al., 15 Jan 2025).
Unsupervised Representation Learning: For each input image, ViTC-UReID computes an “enhanced” feature by concatenating the global CLS token with the top-K local tokens, selected via self-attention scoring and further processed by an MLP and MaxPool operation. The model applies two contrastive-style losses: ClusterNCE (identity discrimination by cluster proxies) and a camera-aware loss (CameraNCE) over camera-center proxies. The total loss for an image sample $x_i$ is: $L_i = L_\mathrm{NCE}(v_i, y_i; M) + \lambda \cdot L_\mathrm{CAP}(v_i, p_i; C)$ where $v_i$ is the enhanced embedding, $y_i$ is the cluster label, and $p_i$ is the camera-cluster label. Parameters are updated via SGD, and proxies and clusters are refreshed after each epoch.

2. Proxy-Based and Camera-Aware Learning Objectives

To simultaneously boost feature discrimination and camera invariance, ViTC-UReID introduces two pivotal contrastive objectives:

ClusterNCE: Enforces identity-level discrimination by contrasting features against their corresponding cluster proxies: $L_\mathrm{NCE}(v_i, y_i; M) = -\log \frac{\exp( v_i \cdot m^+ / \tau )}{\sum_{k=1}^K \exp( v_i \cdot m_k / \tau )}$ with $m^+$ the positive proxy for $y_i$ , and $\tau$ a temperature parameter.
CameraNCE: Incorporates camera-awareness by contrasting features against proxies specific to both identity and camera: $L_\mathrm{CAP}(v_i, p_i; C) = - \frac{1}{|P(i)|} \sum_{b \in P(i)} \log \left[ \frac{\exp( v_i \cdot c_{p_i, b} / \tau_c )} { \sum_{k, b'} \exp( v_i \cdot c_{k, b'} / \tau_c ) } \right]$ where $c_{p_i, b}$ is the mean feature for cluster $p_i$ in camera $b$ , and $\tau_c$ is a temperature for the camera proxy loss. The weight $\lambda$ balances the two objectives (empirically set to $0.7$ or $1.0$) (Pham et al., 4 Jan 2026).

3. Token-Level Modeling and Multi-Scale Memory (TCMM Variant)

In alternative implementations, such as the TCMM variant (Zhu et al., 15 Jan 2025), ViTC-UReID introduces explicit token-level constraints and a multi-scale memory:

ViT Token Constraint: Within each feature sequence, a fraction $\alpha$ of patch tokens is randomly selected to act as "negative" tokens. A constraint loss penalizes these tokens’ similarity to the global CLS token, limiting the impact of spurious or noisy image patches: $L_\mathrm{constraint} = - \frac{1}{B} \sum_{b=1}^B \sum_{i \in N_b} \log \sigma \left( A_i(t_i, t_{cls}) / \tau \right)$
Multi-Scale Memory: Dual memory banks maintain both instance-level and prototype-level features. An instance memory stores every feature with its pseudo-label, while the prototype memory tracks cluster centroids. These are momentum-updated each iteration. Losses are imposed both at the prototype (cluster) and instance level, complementing the token constraint loss.

Combined, these mechanisms suppress patch noise and leverage both global and fine-grained memory, reinforcing robust cluster separation and intra-class compactness.

4. Training Protocol and Hyperparameters

The training process follows these steps:

Pretraining: The ViT-B/16 backbone is pretrained (e.g., on LUPerson).
Epoch Loop (typically $E=60$ ):
- Extract CLS features for all images.
- Run DBSCAN clustering (with dataset-adaptive settings: $\epsilon = 0.6$ , variable min samples).
- Sub-cluster identities by camera.
- Update cluster and camera proxies by averaging.
- Shuffle dataset, construct minibatches (batch size $B=128$ or $64$).
- For each batch, compute enhanced features, cluster/camera proxy losses (and optionally token constraint losses), and backpropagate via SGD.
- Reduce learning rate at predefined epochs.

Main protocol hyperparameters are summarized as follows:

Parameter	Setting (Market)	Setting (MSMT)
Backbone	ViT-B/16 (LUPerson)	ViT-B/16 (LUPerson)
Batch Size	128	128
Learning Rate	1e-3 (decay ×0.1)	1e-3 (decay ×0.1)
DBSCAN eps/minpts	0.6 / 8	0.6 / 16
Top-K locals	40% tokens	40% tokens
Proxy Momentum	0.2	0.2
$\lambda$ (camera loss)	0.7	1.0
Epochs	60	60

5. Quantitative Performance

ViTC-UReID achieves leading results for unsupervised person re-identification, as measured by mean average precision (mAP) and Cumulative Matching Characteristic (CMC) rank-1 metrics:

Dataset	mAP	Rank-1	Rank-5	Rank-10
Market-1501	92.8	97.1	99.1	99.5
MSMT-17	63.6	85.8	92.3	94.1
CUHK03(L)	89.8	91.1	95.1	97.1

On Market-1501, mAP and rank-1 exceed all prior fully unsupervised methods, improving over DAPRH by 6.9 mAP points and over the IQAGA UDA approach by over 56 mAP points when no labeled source data is used. Ablation studies confirm that both enhanced local-global features and the camera-aware proxy loss contribute significant independent gains: for example, on MSMT, camera-aware proxy learning alone boosts mAP by +9.6 and rank-1 by +6.5, compared to an identity-only baseline (Pham et al., 4 Jan 2026). TCMM reports Market-1501 performance of mAP 83.2 and rank-1 92.1 (Zhu et al., 15 Jan 2025).

6. Analysis of Key Mechanisms and Ablations

Ablation experiments demonstrate the synergy between cluster-level, camera-aware, and token-level mechanisms:

Component	Market mAP / R1	MSMT mAP / R1
Baseline (ClusterNCE)	90.7 / 96.4	53.3 / 78.9
+ EIR (CLS+Locals)	92.5 / 96.5	55.7 / 80.6
+ CAP (Camera loss)	92.0 / 96.6	62.9 / 85.4
Full (EIR + CAP)	92.8 / 97.1	63.6 / 85.8

Removing token constraints or memory components in TCMM reduces mAP by 3–4 points on Market. Discarding outlier re-mining (assigning hard DBSCAN outliers) further decreases mAP by up to 1.5. This demonstrates that token-level suppression mitigates patch errors, while memory components and camera-aware proxying are necessary for robust cross-camera matching (Zhu et al., 15 Jan 2025).

7. Significance and Position within ReID Research

ViTC-UReID advances unsupervised ReID by:

Leveraging the transformer architecture, enabling global and self-attention aggregation of visual cues, outperforming convolutional alternatives (e.g., ResNet-50) on large-scale, multi-camera datasets.
Introducing camera-aware proxies, which enforce invariance to camera-specific artifacts directly in the representation.
Implementing token-level suppression to handle transformer susceptibility to background noise and patch artifacts, as evidenced by TCMM.
Providing a fully unsupervised protocol: no labeled target or source data, pseudo-labels derived online via clustering, and no requirement for handcrafted attributes or domain-specific pretraining.

The approach has established new state-of-the-art baselines and demonstrated generalizability across multiple datasets, supporting its utility for real-world surveillance deployments and related transfer-free ReID scenarios (Pham et al., 4 Jan 2026, Zhu et al., 15 Jan 2025).