Multi-Level Tracklet Contrastive Constraint

Updated 29 November 2025

The paper introduces CTACL, which utilizes multi-level tracklet contrastive constraints with weak supervisory signals to enable precise vehicle re-identification.
It integrates intra-camera and inter-camera mining strategies that improve discrimination, achieving up to an 11% Rank-1 boost on benchmarks like VeRi-776.
A camera-uniformity domain adaptation module is employed to mitigate view-specific biases, enhancing robustness in cross-camera matching tasks.

Multi-level tracklet contrastive constraint is a technique employed in the Camera-Tracklet-Aware Contrastive Learning (CTACL) framework for unsupervised vehicle re-identification across multi-camera networks. CTACL capitalizes on the availability of “weak” supervisory signals (camera IDs and per-camera tracklet IDs) to circumvent the need for global identity annotations, which are labor-intensive to obtain due to appearance and viewpoint variations. The multi-level constraint formalizes contrastive objectives at both intra-camera and inter-camera levels, enabling the representation learning pipeline to associate vehicles reliably across disparate camera views while mitigating domain biases (Yu et al., 2021).

1. Data Structure and Subdomain Partitioning

CTACL operates on an unlabelled vehicle dataset $X = \{x_i\}_{i=1}^n$ annotated with camera identifiers $Y^c = \{y^c_i\}_{i=1}^n$ and tracklet identifiers $Y^t = \{y^t_i\}_{i=1}^n$ . The full dataset is decomposed into $C$ camera-level subdomains $\{X^i\}_{i=1}^C$ , where $X^i = \{x_i | y^c_i = i\}$ . Tracklets are formed by temporal grouping of vehicle detections within each camera, which provides localized positive sample sets for contrastive learning.

The encoder $\Phi: x \rightarrow z$ is instantiated as a ResNet-50 network pretrained on ImageNet, followed by $\ell_2$ normalization to produce features $z \in \mathbb{R}^{2048}$ . All extracted features are organized into a Camera-Tracklet-Aware Memory (CTAM), $M = \{M^1, ..., M^C\}$ , with $M^i = \{m^i_1, ..., m^i_{T^i}\}$ and each tracklet bank $m^i_j$ comprising temporally adjacent features for vehicle $j$ in camera $i$ .

2. Intra-Camera Level Constraint

At the intra-camera level, the CTACL loss leverages the deterministic separation of positives and negatives via tracklet membership within a single camera. For anchor $z_i$ , positives $m^{y^c_i}_{y^t_i}$ constitute all features within the same tracklet and camera, while negatives comprise other features in $M^{y^c_i}$ but outside $m^{y^c_i}_{y^t_i}$ . The intra-camera contrastive loss is formally:

$L_{CTACL}^{intra}(z_i) = -\frac{1}{|m^{y^c_i}_{y^t_i}|} \sum_{\hat{z}_p \in m^{y^c_i}_{y^t_i}} \log \frac{\exp(z_i \cdot \hat{z}_p / \tau)}{\sum_{\hat{z}_a \in M^{y^c_i}} \exp(z_i \cdot \hat{z}_a / \tau)}$

where $\tau$ is the temperature hyperparameter (fixed at $0.07$). This explicit tie to tracklet composition within cameras yields high-fidelity intra-camera discrimination.

3. Inter-Camera Level Constraint

The inter-camera (cross-domain) constraint extends the contrastive loss by integrating feature relationships across camera subdomains. Positive samples external to the anchor’s camera are mined as follows:

Easy Positives: The $k$ nearest features to $z_i$ in $M \setminus M^{y^c_i}$ (cosine similarity).
Hard Positives: For anchor $z_i$ , identify the most dissimilar feature within its own tracklet ( $\bar{z}$ ), then extract $\bar{z}$ ’s $k$ nearest neighbors in $M \setminus M^{y^c_i}$ .

Negatives from other cameras are selected as those with low similarity, omitting the top $\gamma\%$ to form a “grey zone” and minimize false negatives. Together, positives are $P^+_i$ and negatives are $N_i$ . The extended CTACL loss is expressed as:

$L_{CTACL}^{ext}(z_i) = -\frac{1}{|m^{y^c_i}_{y^t_i}|+|P^+_i|} \sum_{\hat{z}_p \in m^{y^c_i}_{y^t_i} \cup P^+_i} \log \frac{\exp(z_i \cdot \hat{z}_p / \tau)}{\sum_{\hat{z}_a \in M^{y^c_i} \cup P^+_i \cup N_i} \exp(z_i \cdot \hat{z}_a / \tau)}$

This two-tier mining paradigm encourages shared representations across views and explicit separation from hard negatives.

4. Domain Adaptation Mechanism

Domain bias arises from view-specific artifact learning, which CTACL counters with a camera-uniformity domain adaptation objective. The method constructs a non-parametric camera classifier based on centroid similarities:

$P(y^c = i | z) = \frac{\exp(z \cdot \bar{z}^i)}{\sum_{j=1}^C \exp(z \cdot \bar{z}^j)}$

where centroid $\bar{z}^i = (1/|M^i|) \sum_{\hat{z} \in M^i} \hat{z}$ is $\ell_2$ -normalized. The corresponding domain-adaptation loss enforces maximal uncertainty (KL divergence to uniform distribution):

$L_{DA} = \mathrm{KL}(U(y^c) \| P(y^c|z)) = \sum_{i=1}^C \frac{1}{C} \log \frac{1/C}{P(y^c = i | z)}$

The overall joint objective combines both extended CTACL and DA loss:

$L_{total} = L_{CTACL}^{ext} + \lambda L_{DA}$

with optimal $\lambda \approx 0.2$ .

5. Training Pipeline and Implementation

The CTACL framework follows a structured pipeline:

Data Partitioning: Collect unlabelled images, extract features using $\Phi$ to populate CTAM.
Sampling: For each minibatch (size $256$), select anchors and gather intra/inter-camera positives/negatives as per mining strategies ( $k=5$ , $\gamma=1\%$ ).
Optimization: Initialize $\Phi$ from ImageNet, freeze BatchNorm. Warm-up for 5 epochs with $L_{CTACL}^{intra}$ , then alternate training with the joint objective for 45 epochs, rebuilding CTAM every 5 epochs.
Hyperparameters: SGD optimizer, lr schedule ( $0.1 \to 0.01 \to 0.001$ ), momentum $0.9$, data augmentation via crop, flip, and color jitter.

CTAM is stored on GPU as per-camera arrays and updated by a normalized momentum averaging rule:

$\hat{z}^t = \frac{\hat{z}^{t-1} + z^t}{\|\hat{z}^{t-1} + z^t\|_2}$

6. Experimental Evaluation

Experiments validate CTACL on several benchmarks:

Dataset	Type	#IDs	#Cameras	Metrics	CTACL+DA Rank-1 / mAP
VeRi-776	Image-based	776	18	Rank-1, mAP	89.3%, 55.2%
Veri-Wild(S)	Image-based	40,671	174	Rank-1, mAP	79.2%, 65.0%
VVeRI-901	Video-based	901	11	Rank-1, mAP	38.2%, 29.0%

CTACL+DA provides a gain of $\sim$ 8–11% Rank-1 and mAP over CTACL without DA on VeRi-776, and achieves state-of-the-art performance on all tested unsupervised vehicle re-identification scenarios. Ablations confirm the optimality of mining parameter $k=5$ and grey zone $\gamma=0.01$ , with DA weight $\lambda=0.2$ ; removal of DA results in $\sim$ 8–10% Rank-1 drop.

7. Comparative Analysis

CTACL is benchmarked against alternative baselines:

Self-supervised SimCLR-style contrastive (no tracklets): Rank-1 $\sim$ 23%
Cross-entropy (CE) classification on tracklets: Rank-1 $\sim$ 50%
CTACL with ground-truth IDs: Rank-1 $>$ 92% (upper bound)
CTACL w/o DA: Rank-1 81%–33%; w/ DA: 89%–38%

This establishes the efficacy of multi-level tracklet constraint in extracting discriminative representations without global identity supervision.

8. Significance and Plausible Implications

Multi-level tracklet contrastive constraint demonstrates that leveraging multi-camera partitioning and weakly correlated tracklet labels enables unsupervised vehicle re-identification at scale without human annotation (Yu et al., 2021). This suggests broad applicability to other domains involving cross-view or cross-device matching tasks where direct global identity annotation is impractical. A plausible implication is that similar multi-level mining, memory augmentation, and domain adaptation methods could be generalized to person re-identification, multi-view action recognition, and temporal object association tasks, particularly when “free” subdomain and local cluster labels are available. The successful disentanglement of camera-specific and identity-specific features using auxiliary KL-based domain adaptation signals a promising direction for bias correction in unsupervised learning pipelines.

PDF Markdown Chat (Pro)

References (1)

Camera-Tracklet-Aware Contrastive Learning for Unsupervised Vehicle Re-Identification (2021)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Multi-Level Tracklet Contrastive Constraint.