Unsupervised Visible-Infrared Person Re-Identification

Updated 15 December 2025

The paper introduces a unified embedding network to align RGB and IR images in unsupervised settings, addressing severe modality gaps without labeled data.
It employs multi-level clustering and prototype-based techniques, incorporating intra- and cross-modality associations, optimal transport matching, and bias mitigation strategies.
Results on SYSU-MM01 and RegDB benchmarks demonstrate significant improvements using contrastive loss fusion, memory banks, and robust pseudo-label calibration techniques.

Unsupervised Visible-Infrared Person Re-Identification (USVI-ReID) is the task of matching individuals across visible (RGB) and infrared (IR) modalities without any human-provided annotations. This problem is motivated by practical surveillance scenarios, such as around-the-clock monitoring, where RGB and IR imagery are widely used but exhaustive cross-modality labeling is infeasible. USVI-ReID is characterized by a severe modality gap, the absence of ground-truth correspondence, and significant intra- and inter-modality noise, making it one of the canonical unsupervised cross-modality benchmarks in computer vision (Shi et al., 29 Feb 2024, Cheng et al., 27 Apr 2025, Wu et al., 26 Dec 2024, Wang et al., 8 Dec 2025).

1. Problem Formulation and Core Challenges

Given two unlabeled datasets— $\mathcal{X}^v$ of visible images and $\mathcal{X}^r$ of infrared images—the goal of USVI-ReID is to learn a single embedding network $f_\theta(\cdot)$ mapping both modalities into a shared feature space such that instances of the same identity, regardless of modality, are close, while those of different identities are well separated. Unlike supervised VI-ReID, no explicit identity labels or cross-modal pairs are available.

The primary technical obstacles are as follows:

Modality gap: Visible and infrared images of the same person exhibit substantially different color, texture, and distributional properties, making direct metric learning or clustering unreliable (Cheng et al., 2023, Wang et al., 8 Dec 2025).
No supervision: All pseudo-labels for contrastive or classification objectives must be bootstrapped via unsupervised clustering and cross-modality association mechanisms (Shi et al., 29 Feb 2024, Shi et al., 12 Jan 2024, Wu et al., 26 Dec 2024).
Label noise: Clustering errors and label propagation amplify both homogeneous (within-modality) and heterogeneous (between-modality) noise, leading to pessimistic generalization if not corrected (Teng et al., 16 Dec 2024, Cheng et al., 27 Apr 2025, Yin et al., 9 May 2024).

2. Prototype-Based and Clustering Paradigms

Most USVI-ReID solutions employ a cluster-based pipeline. This entails two levels of clustering and prototype construction:

Intra-modality clustering: Features from each modality are independently grouped (typically by DBSCAN or k-reciprocal Jaccard) (Cheng et al., 2023, Yang et al., 11 Dec 2024). Cluster centroids or part-level features are used as pseudo-prototypes.
Cross-modality association: Cluster matching across modalities is performed via optimal transport, bipartite matching, or global clustering employing modality-aware metrics (Wang et al., 8 Dec 2025, Wang et al., 8 Dec 2025, Cheng et al., 27 Apr 2025, Wang et al., 8 Dec 2025). Resulting matched clusters deliver cross-modality pseudo-labels.

Recent methodologies extend beyond naive clustering to mitigate fundamental shortfalls:

Bias mitigation: Modality-aware Jaccard or reciprocal expansion equalizes neighborhood counts from both modalities/cameras, counteracting the tendency of features to cluster by acquisition source (Wang et al., 8 Dec 2025, Yang et al., 11 Dec 2024).
Fine-grained modeling: Hierarchical or multi-center prototypes enhance intra-class diversity, capturing both core and periphery instances and improving inter- and intra-modality alignment (Shi et al., 15 Sep 2025, Shi et al., 29 Feb 2024, Shi et al., 12 Jan 2024).

3. Loss Functions and Learning Strategies

USVI-ReID training schedules are dominated by variants of instance-to-prototype (InfoNCE-style) contrastive loss or cross-entropy using pseudo-labels. Three canonical prototype selection strategies—commonality, divergence, and variety—have been explicitly formulated (Shi et al., 29 Feb 2024):

Commonality: Centroid-based prototype for each cluster:

$\mathbf{c}_k = \frac{1}{|\mathcal{C}_k|} \sum_{x_i \in \mathcal{C}_k} f(x_i)$

Divergence: Hardest boundary sample within a cluster:

$\mathbf{h}_k = \arg\max_{x_i \in \mathcal{C}_k} \| f(x_i) - \mathbf{c}_k \|$

Variety: Random dynamic prototype to cover intra-cluster diversity.

The total contrastive loss often fuses these elements according to a progressive, stage-wise schedule to first stabilize clusters (commonality), then increase decision-boundary sharpness (divergence and variety) (Shi et al., 29 Feb 2024). In some frameworks, multi-center contrastive losses are used to align both coarse- and fine-grained cluster structure (Shi et al., 15 Sep 2025, Shi et al., 12 Jan 2024).

The learning process is further augmented by advanced memory bank architectures, soft label propagation, and dynamic sample weighting:

Memory banks: Instance-level, cluster-level, and global cross-modality prototypes are maintained with momentum updates to capture evolving data structure and prevent collapse (Wu et al., 26 Dec 2024, Yang et al., 11 Dec 2024).
Soft/neighbor-inferred pseudo-labels: Weighted averages of neighbor assignments are used to replace brittle one-hot labels, reducing label noise (Teng et al., 16 Dec 2024, Liu et al., 10 Apr 2024).
Dynamic loss weighting: Low-confidence or inconsistent samples receive reduced loss contributions based on neighborhood or loss-statistics-derived weighting (Teng et al., 16 Dec 2024, Liu et al., 10 Apr 2024).

4. Cross-Modality Association and Bias Mitigation

A central innovation in recent USVI-ReID methods is modality-aware association. Instead of treating visible and infrared data as distributionally symmetric, state-of-the-art methods construct modality-aware distances and employ optimal transport or bipartite assignment to match clusters:

Modality-aware Jaccard distance: For any sample, neighbors are drawn equally from both modalities, the reciprocal neighbor set is balanced, and cluster-linking is performed over this rectified distance matrix to avoid dominance by well-sampled modalities (Wang et al., 8 Dec 2025).
Optimal transport-based matching: The assignment problem is regularized by a uniform prior over possible matches, ensuring fairness. Matching is performed in both visible→infrared and infrared→visible directions with entropic regularization, and cross-modality cluster pairs are declared wherever at least one direction yields a high-confidence match (Zhang et al., 17 Jul 2024).
Supplementary graph and Hungarian matching: Graph-based association mitigates cluster fragmentation and leverages both feature-space and relational metrics (e.g., mean feature Jaccard) (Huang et al., 20 Nov 2025).
Bidirectional consistency: Reciprocal assignment constraints reject noisy, non-mutual matches, increasing correspondence precision (Shi et al., 15 Sep 2025).

Such mechanisms directly address the high propensity for cluster mismatch due to dominance bias or sample imbalance, and are crucial for robust identity transfer across the visible-infrared divide.

5. Label Noise Modeling and Robust Optimization

USVI-ReID training is fundamentally susceptible to label noise, both due to unsupervised clustering and inherent ambiguity in cross-modal appearance. Several orthogonal approaches have been proposed:

Noisy pseudo-label calibration: Pseudo-label assignments are recalibrated using intra-cluster neighbor statistics, sample agreement, or Beta/gmm mixture models on loss values to probabilistically identify and correct mislabelled samples (Liu et al., 10 Apr 2024, Yin et al., 9 May 2024).
Neighbor-guided label refinement: Pseudo-labels are corrected or smoothed by aggregating over neighborhood consensus, and samples with inconsistency above a threshold are dynamically down-weighted in the loss (Teng et al., 16 Dec 2024, Cheng et al., 2023).
Dynamic weighting and outlier rejection: Sample weights or loss contributions are adaptively set based on neighbor-purity or model confidence, allowing robust training even in the presence of persistent noise (Teng et al., 16 Dec 2024, Liu et al., 10 Apr 2024).
Progressive schedule and multi-stage optimization: Curriculum strategies initially focus on stable core clusters and gradually introduce more aggressive association/contrastive objectives as the feature space matures (Shi et al., 29 Feb 2024, Wu et al., 26 Dec 2024, Yang et al., 11 Dec 2024).

Empirical evidence shows that these noise-aware techniques are essential for high-fidelity cross-modality correspondences, especially in benchmarks with large modality imbalance or extreme identity variation.

6. Multi-Scale, Part-Level, and Hierarchical Modeling

To address the limitations of global features and cluster-level prototypes, several recent works have integrated multi-scale or part-level mechanisms:

Multi-center/multi-prototype memory: Clusters are further split into subgroups or "part" centers by within-cluster K-means, and multi-center contrastive learning is applied to sharpen intra-identity boundaries and highlight local variations (Shi et al., 15 Sep 2025, Shi et al., 29 Feb 2024).
Fine-grained semantic alignment: Part-based features (extracted from convolutional map divisions) are semantically aligned across modalities using query-guided attention mechanisms. Separate memory banks are maintained for each part, and joint contrastive learning is conducted at both global and part-level scales (Cheng et al., 27 Apr 2025).
Hybrid memory and collaborative refinement: Memory banks simultaneously store modality-specific, modality-invariant, global, and part-level features, and positive sets are dynamically mined via multi-level neighborhood intersection and part mutual correction (Cheng et al., 27 Apr 2025, Yin et al., 9 May 2024).

These strategies allow the model to accommodate intra-identity diversity, viewpoint, and occlusion, and overcome coarse association errors caused by global descriptor collapse.

7. Benchmarks, Results, and Trends

USVI-ReID research is empirically grounded on two benchmarks:

SYSU-MM01: Multiple camera, multimodal dataset (22,258 RGB, 11,909 IR images, 395 train/96 test IDs), two main protocols ("all-search" and "indoor-search").
RegDB: Paired visible/infrared dataset (412 IDs, ~8,000 images), reporting both directions of retrieval.

Methods integrating modality-aware matching, label-noise modeling, and multi-level prototypes have established new state-of-the-art performance. For example, (Wang et al., 8 Dec 2025) reports 67.1% Rank-1 and 63.1% mAP on SYSU-MM01 (all-search), and 94.3% Rank-1 and 89.1% mAP on RegDB, outperforming earlier approaches by significant margins. Ablation studies consistently show that bias mitigation, robust cross-modality association, and hierarchical or part-level modeling contribute orthogonally to gains in accuracy and cluster purity.

These results indicate that, while modality discrepancies and noisy supervision present inherent limitations, judicious integration of bias mitigation, robust clustering, and multi-scale feature modeling can close a large proportion of the performance gap to fully supervised VI-ReID, even in large-scale, diverse benchmarks (Cheng et al., 27 Apr 2025, Wu et al., 26 Dec 2024).

References:

(Shi et al., 29 Feb 2024, Cheng et al., 27 Apr 2025, Wu et al., 26 Dec 2024, Wang et al., 8 Dec 2025, Huang et al., 20 Nov 2025, Shi et al., 15 Sep 2025, Zhang et al., 17 Jul 2024, Teng et al., 16 Dec 2024, Shi et al., 12 Jan 2024, Cheng et al., 2023, Yang et al., 11 Dec 2024, Yin et al., 9 May 2024).