End-to-End Re-ID with Comparative Attention

Updated 10 June 2026

End-to-end Re-ID systems are deep learning frameworks that extract discriminative identity representations directly from raw pedestrian images using unified CNN and attention mechanisms.
The soft comparative attention mechanism dynamically focuses on salient local regions via LSTM-based glimpse extraction, enhancing robustness against occlusion and pose variations.
Multi-task training with identification and triplet ranking losses consistently improves matching accuracy, outperforming benchmarks on datasets like CUHK03 and Market-1501.

End-to-end re-identification (Re-ID) systems address the challenge of cross-view person matching, enabling the identification of individuals across disjoint camera networks as required in video surveillance. The paradigm is characterized by fully trainable deep models that ingest raw pedestrian images and directly output discriminative representations suitable for identity matching, eschewing hand-crafted feature extraction. The End-to-End Comparative Attention Network (CAN) is a canonical architecture in this field, leveraging spatially localized, sequential attention within a triplet-based learning framework to adaptively compare local regions of person images and synthesize robust identity descriptors (Liu et al., 2016).

1. Architecture of End-to-End Comparative Attention Networks

The CAN framework operates on image triplets $\langle I, I^+, I^- \rangle$ , denoting an anchor, a positive sample of the same identity, and a negative sample of a different identity. The architecture comprises three parameter-sharing branches, each containing:

A CNN backbone (e.g., truncated AlexNet or VGG-16) that extracts a feature tensor $\mathbf{X} \in \mathbb{R}^{K \times K \times D}$ .
A recurrent attention module, implemented using LSTM cells, which predicts spatial attention masks and selectively pools features at each time step ("glimpse").
A glimpse extractor that concatenates LSTM hidden states at selected time steps (typically $t = 2, 4, 8$ ) to form a comprehensive descriptor $\mathbf{R} = [\mathbf{h}_2; \mathbf{h}_4; \mathbf{h}_8]$ , which is then $\ell_2$ -normalized to yield $\mathbf{H} \in \mathbb{R}^{3q}$ .

During inference, only two branches (query and gallery) are used, and identity retrieval is performed via Euclidean distance ranking of descriptors.

2. Soft Comparative Attention Mechanism

The soft attention mechanism enables CAN to dynamically focus on salient local regions in each glimpse. For each image branch, the process is as follows:

Let $\mathbf{X}_i \in \mathbb{R}^D$ denote the $i$ th spatial location.
At time $t$ , compute energy scores $e_{t,i}$ via

$\mathbf{X} \in \mathbb{R}^{K \times K \times D}$ 0

Normalize energy scores to attention weights using softmax:

$\mathbf{X} \in \mathbb{R}^{K \times K \times D}$ 1

Form the glimpse feature with weighted pooling:

$\mathbf{X} \in \mathbb{R}^{K \times K \times D}$ 2

In comparative attention, the anchor and its positive/negative counterparts each produce a glimpse; their difference $\mathbf{X} \in \mathbb{R}^{K \times K \times D}$ 3 is input to the LSTM, imbuing the network with the ability to focus attention conditioned on relative local appearance differences.

3. Training Objectives and Multi-Task Loss

CAN training optimizes two concurrent objectives:

Identification Loss (Softmax):

$\mathbf{X} \in \mathbb{R}^{K \times K \times D}$ 4

where $\mathbf{X} \in \mathbb{R}^{K \times K \times D}$ 5 and $\mathbf{X} \in \mathbb{R}^{K \times K \times D}$ 6 is the ground-truth label.

Triplet Ranking Loss:

$\mathbf{X} \in \mathbb{R}^{K \times K \times D}$ 7

ensuring the anchor-positive distance is less than anchor-negative by margin $\mathbf{X} \in \mathbb{R}^{K \times K \times D}$ 8.

The final loss combines the two, $\mathbf{X} \in \mathbb{R}^{K \times K \times D}$ 9, with equal weighting.

4. Training and Inference Workflow

The canonical CAN training workflow proceeds as follows:

Pre-train the CNN component on large Re-ID datasets using softmax identification loss.
Attach the attention/LSTM modules and initialize.
For each mini-batch,
- Sample $t = 2, 4, 8$ 0 triplets $t = 2, 4, 8$ 1.
- For each branch, process the image through the shared CNN to obtain $t = 2, 4, 8$ 2.
- Run the recurrent attention for $t = 2, 4, 8$ 3 glimpses, updating LSTM states with comparative inputs ( $t = 2, 4, 8$ 4 for anchor).
- Concatenate hidden states at specified steps; $t = 2, 4, 8$ 5-normalize to obtain descriptors.
- Compute $t = 2, 4, 8$ 6 and perform back-propagation.

At inference:

For any query-gallery pair, extract descriptors through CNN+LSTM.
Calculate Euclidean distance between descriptors.
Rank gallery images by distance and report performance via cumulative matching characteristic (CMC) and mean average precision (mAP).

5. Experimental Datasets, Implementation, and Performance

Evaluation spans four established benchmarks:

Dataset	#IDs	Camera Views	Special Features
CUHK01	971	2	Tests with 100/486 IDs
CUHK03	1,360	multiple	Manual + DPM-detected crops
Market-1501	1,501	6	Single & multi-query, CMC/mAP
VIPeR	632	2	Challenging, small-scale

Implementation specifics include CNN pre-training, LSTM hidden size 512, $t = 2, 4, 8$ 7 glimpses, margin $t = 2, 4, 8$ 8, online triplet sampling, SGD with momentum 0.9, weight decay $t = 2, 4, 8$ 9, learning rate 0.001, and data augmentation via translation, flipping, and label shuffling.

Key results:

CUHK01 (100 IDs): AlexNet-CAN 82.8%, VGG-CAN 87.2% (vs. prior 86.6%).
CUHK03 (labeled): AlexNet 72.3%, VGG 77.6% (prior 75.3%).
Market-1501 single-query: AlexNet mAP 30.3%/Rank-1 55.1%; VGG mAP 35.9%/Rank-1 60.3% (prior mAP 35.7%, Rank-1 61.1%).
VIPeR: AlexNet 41.5%, VGG 47.2%; combining VGG-CAN with LOMO features yields 54.1% (competes with SCSP’s 53.5%).

In all benchmarks, recurrent comparative attention either matches or advances the state-of-the-art, indicating the utility of sequential, local-region comparisons.

6. Significance and Implications

By formulating person re-identification as end-to-end comparative attention across multiple glimpses, CAN demonstrates that adaptively focusing on and contrasting discriminative image regions leads to identity descriptors robust to viewpoint, occlusion, and pose variation. The soft attention mechanism, when combined with LSTM-based sequential modeling, enables both spatial localization and integration of discriminative cues over time. This suggests that further advances in end-to-end Re-ID may benefit from more expressive attention/control mechanisms and integration with advanced backbone architectures (Liu et al., 2016).

Markdown Report Issue Upgrade to Chat

References (1)

End-to-End Comparative Attention Networks for Person Re-identification (2016)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to End-to-End Re-Identification (Re-ID) Systems.

End-to-End Re-ID with Comparative Attention

1. Architecture of End-to-End Comparative Attention Networks

2. Soft Comparative Attention Mechanism

3. Training Objectives and Multi-Task Loss

4. Training and Inference Workflow

5. Experimental Datasets, Implementation, and Performance

6. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

End-to-End Re-ID with Comparative Attention

1. Architecture of End-to-End Comparative Attention Networks

2. Soft Comparative Attention Mechanism

3. Training Objectives and Multi-Task Loss

4. Training and Inference Workflow

5. Experimental Datasets, Implementation, and Performance

6. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research