Papers
Topics
Authors
Recent
2000 character limit reached

Multi-Level Tracklet Contrastive Constraint

Updated 29 November 2025
  • The paper introduces CTACL, which utilizes multi-level tracklet contrastive constraints with weak supervisory signals to enable precise vehicle re-identification.
  • It integrates intra-camera and inter-camera mining strategies that improve discrimination, achieving up to an 11% Rank-1 boost on benchmarks like VeRi-776.
  • A camera-uniformity domain adaptation module is employed to mitigate view-specific biases, enhancing robustness in cross-camera matching tasks.

Multi-level tracklet contrastive constraint is a technique employed in the Camera-Tracklet-Aware Contrastive Learning (CTACL) framework for unsupervised vehicle re-identification across multi-camera networks. CTACL capitalizes on the availability of “weak” supervisory signals (camera IDs and per-camera tracklet IDs) to circumvent the need for global identity annotations, which are labor-intensive to obtain due to appearance and viewpoint variations. The multi-level constraint formalizes contrastive objectives at both intra-camera and inter-camera levels, enabling the representation learning pipeline to associate vehicles reliably across disparate camera views while mitigating domain biases (Yu et al., 2021).

1. Data Structure and Subdomain Partitioning

CTACL operates on an unlabelled vehicle dataset X={xi}i=1nX = \{x_i\}_{i=1}^n annotated with camera identifiers Yc={yic}i=1nY^c = \{y^c_i\}_{i=1}^n and tracklet identifiers Yt={yit}i=1nY^t = \{y^t_i\}_{i=1}^n. The full dataset is decomposed into CC camera-level subdomains {Xi}i=1C\{X^i\}_{i=1}^C, where Xi={xiyic=i}X^i = \{x_i | y^c_i = i\}. Tracklets are formed by temporal grouping of vehicle detections within each camera, which provides localized positive sample sets for contrastive learning.

The encoder Φ:xz\Phi: x \rightarrow z is instantiated as a ResNet-50 network pretrained on ImageNet, followed by 2\ell_2 normalization to produce features zR2048z \in \mathbb{R}^{2048}. All extracted features are organized into a Camera-Tracklet-Aware Memory (CTAM), M={M1,...,MC}M = \{M^1, ..., M^C\}, with Mi={m1i,...,mTii}M^i = \{m^i_1, ..., m^i_{T^i}\} and each tracklet bank mjim^i_j comprising temporally adjacent features for vehicle jj in camera ii.

2. Intra-Camera Level Constraint

At the intra-camera level, the CTACL loss leverages the deterministic separation of positives and negatives via tracklet membership within a single camera. For anchor ziz_i, positives myityicm^{y^c_i}_{y^t_i} constitute all features within the same tracklet and camera, while negatives comprise other features in MyicM^{y^c_i} but outside myityicm^{y^c_i}_{y^t_i}. The intra-camera contrastive loss is formally:

LCTACLintra(zi)=1myityicz^pmyityiclogexp(ziz^p/τ)z^aMyicexp(ziz^a/τ)L_{CTACL}^{intra}(z_i) = -\frac{1}{|m^{y^c_i}_{y^t_i}|} \sum_{\hat{z}_p \in m^{y^c_i}_{y^t_i}} \log \frac{\exp(z_i \cdot \hat{z}_p / \tau)}{\sum_{\hat{z}_a \in M^{y^c_i}} \exp(z_i \cdot \hat{z}_a / \tau)}

where τ\tau is the temperature hyperparameter (fixed at $0.07$). This explicit tie to tracklet composition within cameras yields high-fidelity intra-camera discrimination.

3. Inter-Camera Level Constraint

The inter-camera (cross-domain) constraint extends the contrastive loss by integrating feature relationships across camera subdomains. Positive samples external to the anchor’s camera are mined as follows:

  • Easy Positives: The kk nearest features to ziz_i in MMyicM \setminus M^{y^c_i} (cosine similarity).
  • Hard Positives: For anchor ziz_i, identify the most dissimilar feature within its own tracklet (zˉ\bar{z}), then extract zˉ\bar{z}’s kk nearest neighbors in MMyicM \setminus M^{y^c_i}.

Negatives from other cameras are selected as those with low similarity, omitting the top γ%\gamma\% to form a “grey zone” and minimize false negatives. Together, positives are Pi+P^+_i and negatives are NiN_i. The extended CTACL loss is expressed as:

LCTACLext(zi)=1myityic+Pi+z^pmyityicPi+logexp(ziz^p/τ)z^aMyicPi+Niexp(ziz^a/τ)L_{CTACL}^{ext}(z_i) = -\frac{1}{|m^{y^c_i}_{y^t_i}|+|P^+_i|} \sum_{\hat{z}_p \in m^{y^c_i}_{y^t_i} \cup P^+_i} \log \frac{\exp(z_i \cdot \hat{z}_p / \tau)}{\sum_{\hat{z}_a \in M^{y^c_i} \cup P^+_i \cup N_i} \exp(z_i \cdot \hat{z}_a / \tau)}

This two-tier mining paradigm encourages shared representations across views and explicit separation from hard negatives.

4. Domain Adaptation Mechanism

Domain bias arises from view-specific artifact learning, which CTACL counters with a camera-uniformity domain adaptation objective. The method constructs a non-parametric camera classifier based on centroid similarities:

P(yc=iz)=exp(zzˉi)j=1Cexp(zzˉj)P(y^c = i | z) = \frac{\exp(z \cdot \bar{z}^i)}{\sum_{j=1}^C \exp(z \cdot \bar{z}^j)}

where centroid zˉi=(1/Mi)z^Miz^\bar{z}^i = (1/|M^i|) \sum_{\hat{z} \in M^i} \hat{z} is 2\ell_2-normalized. The corresponding domain-adaptation loss enforces maximal uncertainty (KL divergence to uniform distribution):

LDA=KL(U(yc)P(ycz))=i=1C1Clog1/CP(yc=iz)L_{DA} = \mathrm{KL}(U(y^c) \| P(y^c|z)) = \sum_{i=1}^C \frac{1}{C} \log \frac{1/C}{P(y^c = i | z)}

The overall joint objective combines both extended CTACL and DA loss:

Ltotal=LCTACLext+λLDAL_{total} = L_{CTACL}^{ext} + \lambda L_{DA}

with optimal λ0.2\lambda \approx 0.2.

5. Training Pipeline and Implementation

The CTACL framework follows a structured pipeline:

  • Data Partitioning: Collect unlabelled images, extract features using Φ\Phi to populate CTAM.
  • Sampling: For each minibatch (size $256$), select anchors and gather intra/inter-camera positives/negatives as per mining strategies (k=5k=5, γ=1%\gamma=1\%).
  • Optimization: Initialize Φ\Phi from ImageNet, freeze BatchNorm. Warm-up for 5 epochs with LCTACLintraL_{CTACL}^{intra}, then alternate training with the joint objective for 45 epochs, rebuilding CTAM every 5 epochs.
  • Hyperparameters: SGD optimizer, lr schedule (0.10.010.0010.1 \to 0.01 \to 0.001), momentum $0.9$, data augmentation via crop, flip, and color jitter.

CTAM is stored on GPU as per-camera arrays and updated by a normalized momentum averaging rule:

z^t=z^t1+ztz^t1+zt2\hat{z}^t = \frac{\hat{z}^{t-1} + z^t}{\|\hat{z}^{t-1} + z^t\|_2}

6. Experimental Evaluation

Experiments validate CTACL on several benchmarks:

Dataset Type #IDs #Cameras Metrics CTACL+DA Rank-1 / mAP
VeRi-776 Image-based 776 18 Rank-1, mAP 89.3%, 55.2%
Veri-Wild(S) Image-based 40,671 174 Rank-1, mAP 79.2%, 65.0%
VVeRI-901 Video-based 901 11 Rank-1, mAP 38.2%, 29.0%

CTACL+DA provides a gain of \sim8–11% Rank-1 and mAP over CTACL without DA on VeRi-776, and achieves state-of-the-art performance on all tested unsupervised vehicle re-identification scenarios. Ablations confirm the optimality of mining parameter k=5k=5 and grey zone γ=0.01\gamma=0.01, with DA weight λ=0.2\lambda=0.2; removal of DA results in \sim8–10% Rank-1 drop.

7. Comparative Analysis

CTACL is benchmarked against alternative baselines:

  • Self-supervised SimCLR-style contrastive (no tracklets): Rank-1 \sim23%
  • Cross-entropy (CE) classification on tracklets: Rank-1 \sim50%
  • CTACL with ground-truth IDs: Rank-1 >>92% (upper bound)
  • CTACL w/o DA: Rank-1 81%–33%; w/ DA: 89%–38%

This establishes the efficacy of multi-level tracklet constraint in extracting discriminative representations without global identity supervision.

8. Significance and Plausible Implications

Multi-level tracklet contrastive constraint demonstrates that leveraging multi-camera partitioning and weakly correlated tracklet labels enables unsupervised vehicle re-identification at scale without human annotation (Yu et al., 2021). This suggests broad applicability to other domains involving cross-view or cross-device matching tasks where direct global identity annotation is impractical. A plausible implication is that similar multi-level mining, memory augmentation, and domain adaptation methods could be generalized to person re-identification, multi-view action recognition, and temporal object association tasks, particularly when “free” subdomain and local cluster labels are available. The successful disentanglement of camera-specific and identity-specific features using auxiliary KL-based domain adaptation signals a promising direction for bias correction in unsupervised learning pipelines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Multi-Level Tracklet Contrastive Constraint.