End-to-End Comparative Attention Network
- The paper introduces an end-to-end comparative attention network that leverages recurrent soft attention and triplet ranking loss to enhance person re-identification.
- It employs a backbone CNN with an LSTM-based attention module to sequentially focus on discriminative regions, effectively handling occlusion, pose, and lighting variations.
- Experimental results on benchmarks like CUHK01 and Market-1501 demonstrate superior accuracy and robustness over traditional pooling and feature extraction methods.
The End-to-End Comparative Attention Network (CAN) is a deep neural architecture specifically designed for person re-identification (re-id) across disjoint camera views. CAN addresses the substantial challenges in re-id posed by appearance variations due to lighting, camera angle, occlusion, and pose. Unlike previous methods that extract frame-level features in a single pass, CAN utilizes a recurrent soft attention mechanism to "glimpse" discriminative regions of person images over multiple steps, adaptively comparing local appearances to robustly determine identity correspondence (Liu et al., 2016).
1. Network Architecture
CAN is composed of three principal modules per branch: a backbone CNN, a comparative attention module (stacked LSTM with soft attention), and a feature aggregator. The backbone is a truncated AlexNet or VGG-16, processing person images resized to 227×227 or 224×224, respectively. The CNN computes a global feature map , e.g., for AlexNet at pool5.
Training utilizes triplets , where is an anchor, a same-identity positive, and a negative. Each image in the triplet is processed through parallel, weight-sharing CNNs followed by attention-LSTM modules. At inference, two branches compare query and gallery images.
The comparative attention module consists of:
- Glimpse location network: calculates an attention vector over spatial locations using softmaxed linear projections of the LSTM hidden state.
- Soft attention pooling: computes a spatially weighted sum, , for glimpse .
- Recurrent LSTM: updates hidden and cell states with the attended glimpse, propagating adaptive context over 0 time steps. A final descriptor is constructed by concatenating selected hidden states (e.g., time steps 2, 4, 8), then 1-normalizing.
2. Soft Attention and Glimpse Mechanism
The core of CAN is a differentiable soft-attention scheme that learns spatial masks for CNN features, enabling gradient-based optimization end-to-end. At each time step 2, attention weights
3
are computed, with 4 learned and 5 denoting the previous LSTM hidden state.
The attention vector yields the glimpse feature:
6
allowing the model to softly aggregate local features. Initial LSTM states are functions of the global average-pooled CNN features.
This multi-glimpse iterative attention process enables CAN to sequentially focus on and compare body regions likely to be salient for re-identifying individuals under severe appearance changes.
3. Training Paradigm and Loss Functions
CAN is trained with two supervised objectives:
- Triplet ranking loss enforces that anchor and positive descriptors are closer than anchor and negative by at least a margin 7, using:
8
- Identification (softmax) loss encourages each branch to produce discriminative identity predictions for 9 classes:
0
where 1 over class weights 2.
The unweighted sum 3 is minimized end-to-end. On-line triplet selection and stochastic data augmentation are performed during training.
4. Implementation Specifics
CAN utilizes eight glimpses (4), concatenating LSTM states at time steps 2, 4, and 8 for the final descriptor. LSTM hidden/cell sizes are 512. Training leverages momentum (0.9), weight decay (5), and learning rates decayed as 6, with 7, 8, 9. Batch size is 134 (AlexNet) or 66 (VGG-16), selected by cross-validation.
Images undergo random translation (±5%) and horizontal flipping, with datasets shuffled each epoch for diverse triplets. Both manually cropped and detection-based bounding boxes are supported, with performance evaluated on single-shot and multi-query protocols.
5. Empirical Results Across Benchmarks
CAN achieves state-of-the-art or highly competitive accuracy on major person re-id datasets.
| Dataset | Architecture | Rank-1 Accuracy | Comparison Baselines |
|---|---|---|---|
| CUHK01 (100 IDs) | CAN (AlexNet) | 82.8% | PersonNet (71.1%) |
| CAN (VGG-16) | 87.2% | - | |
| CUHK01 (486 IDs) | CAN (VGG-16) | 67.2% | DNS (69.1%) |
| CUHK03 (labeled) | CAN (VGG-16) | 77.6% | FT-JSTL+DGD (75.3%) |
| CUHK03 (detected) | CAN (VGG-16) | 69.2% | DNS (54.7%), LOMO+XQDA (46.3%) |
| Market-1501 (single-query) | CAN (VGG-16) | 60.3%, mAP 35.9% | DNS (61.1%, mAP 35.7%) |
| Market-1501 (multi-query) | CAN (VGG-16) | 72.1%, mAP 47.9% | New SOTA |
| VIPeR | CAN (VGG-16) | 47.2% | SCSP (53.5%) |
| VIPeR (feature fusion) | CAN + LOMO | 54.1% | - |
Ablation studies indicate the superiority of LSTM-based recurrent attention over average/max pooling or shallow FC alternatives, confirm that end-to-end joint training provides significant performance gain, and that optimal results are found with 0 glimpses and time-step concatenation at 1. VGG-16 consistently outperforms AlexNet by approximately 5% absolute margin. The use of soft attention yields a Rank-1 of 72.3%, outperforming alternatives (avg-LSTM 58.3%, max-LSTM 57.9%) (Liu et al., 2016).
6. Design Rationale and Comparative Context
CAN's iterative attention mechanism simulates human visual perception by adaptively selecting informative person image regions over several spatial glimpses. This enables more robust identity discrimination under occlusion, pose, and viewpoint variation compared to conventional global or hard part-based features. By jointly optimizing triplet and identification losses, CAN harmonizes global identity structure with fine-grained comparative learning, thereby achieving improved generalization.
Comparative results establish that soft attention and recurrent processing (LSTM) are critical components; direct substitution with FC layers or fixed-pooling significantly degrades performance. End-to-end training of the CNN backbone and attention module, as opposed to fixed feature extraction, further improves accuracy by over 25%.
7. Impact and Evaluation
The introduction of CAN set a new benchmark in person re-identification by integrating an end-to-end soft attention framework that adaptively attends to salient body regions across image pairs or triplets. Empirical results across CUHK01, CUHK03, Market-1501, and VIPeR demonstrate its superior accuracy relative to then-contemporary deep and hand-crafted methods, with major gains for larger, more challenging datasets. The approach of recurrent comparative attention has influenced subsequent architectures in video and multi-view recognition domains (Liu et al., 2016).