Papers
Topics
Authors
Recent
Search
2000 character limit reached

End-to-End Comparative Attention Network

Updated 10 June 2026
  • The paper introduces an end-to-end comparative attention network that leverages recurrent soft attention and triplet ranking loss to enhance person re-identification.
  • It employs a backbone CNN with an LSTM-based attention module to sequentially focus on discriminative regions, effectively handling occlusion, pose, and lighting variations.
  • Experimental results on benchmarks like CUHK01 and Market-1501 demonstrate superior accuracy and robustness over traditional pooling and feature extraction methods.

The End-to-End Comparative Attention Network (CAN) is a deep neural architecture specifically designed for person re-identification (re-id) across disjoint camera views. CAN addresses the substantial challenges in re-id posed by appearance variations due to lighting, camera angle, occlusion, and pose. Unlike previous methods that extract frame-level features in a single pass, CAN utilizes a recurrent soft attention mechanism to "glimpse" discriminative regions of person images over multiple steps, adaptively comparing local appearances to robustly determine identity correspondence (Liu et al., 2016).

1. Network Architecture

CAN is composed of three principal modules per branch: a backbone CNN, a comparative attention module (stacked LSTM with soft attention), and a feature aggregator. The backbone is a truncated AlexNet or VGG-16, processing person images resized to 227×227 or 224×224, respectively. The CNN computes a global feature map X∈RK×K×D\mathbf{X} \in \mathbb{R}^{K \times K \times D}, e.g., K=6,D=256K=6, D=256 for AlexNet at pool5.

Training utilizes triplets ⟨I,I+,I−⟩\langle I, I^+, I^- \rangle, where II is an anchor, I+I^+ a same-identity positive, and I−I^- a negative. Each image in the triplet is processed through parallel, weight-sharing CNNs followed by attention-LSTM modules. At inference, two branches compare query and gallery images.

The comparative attention module consists of:

  • Glimpse location network: calculates an attention vector lt−1\mathbf{l}_{t-1} over K2K^2 spatial locations using softmaxed linear projections of the LSTM hidden state.
  • Soft attention pooling: computes a spatially weighted sum, At=∑i=1K2lt−1,iXi\mathbf{A}_t = \sum_{i=1}^{K^2} l_{t-1,i} \mathbf{X}_i, for glimpse tt.
  • Recurrent LSTM: updates hidden and cell states with the attended glimpse, propagating adaptive context over K=6,D=256K=6, D=2560 time steps. A final descriptor is constructed by concatenating selected hidden states (e.g., time steps 2, 4, 8), then K=6,D=256K=6, D=2561-normalizing.

2. Soft Attention and Glimpse Mechanism

The core of CAN is a differentiable soft-attention scheme that learns spatial masks for CNN features, enabling gradient-based optimization end-to-end. At each time step K=6,D=256K=6, D=2562, attention weights

K=6,D=256K=6, D=2563

are computed, with K=6,D=256K=6, D=2564 learned and K=6,D=256K=6, D=2565 denoting the previous LSTM hidden state.

The attention vector yields the glimpse feature:

K=6,D=256K=6, D=2566

allowing the model to softly aggregate local features. Initial LSTM states are functions of the global average-pooled CNN features.

This multi-glimpse iterative attention process enables CAN to sequentially focus on and compare body regions likely to be salient for re-identifying individuals under severe appearance changes.

3. Training Paradigm and Loss Functions

CAN is trained with two supervised objectives:

  • Triplet ranking loss enforces that anchor and positive descriptors are closer than anchor and negative by at least a margin K=6,D=256K=6, D=2567, using:

K=6,D=256K=6, D=2568

  • Identification (softmax) loss encourages each branch to produce discriminative identity predictions for K=6,D=256K=6, D=2569 classes:

⟨I,I+,I−⟩\langle I, I^+, I^- \rangle0

where ⟨I,I+,I−⟩\langle I, I^+, I^- \rangle1 over class weights ⟨I,I+,I−⟩\langle I, I^+, I^- \rangle2.

The unweighted sum ⟨I,I+,I−⟩\langle I, I^+, I^- \rangle3 is minimized end-to-end. On-line triplet selection and stochastic data augmentation are performed during training.

4. Implementation Specifics

CAN utilizes eight glimpses (⟨I,I+,I−⟩\langle I, I^+, I^- \rangle4), concatenating LSTM states at time steps 2, 4, and 8 for the final descriptor. LSTM hidden/cell sizes are 512. Training leverages momentum (0.9), weight decay (⟨I,I+,I−⟩\langle I, I^+, I^- \rangle5), and learning rates decayed as ⟨I,I+,I−⟩\langle I, I^+, I^- \rangle6, with ⟨I,I+,I−⟩\langle I, I^+, I^- \rangle7, ⟨I,I+,I−⟩\langle I, I^+, I^- \rangle8, ⟨I,I+,I−⟩\langle I, I^+, I^- \rangle9. Batch size is 134 (AlexNet) or 66 (VGG-16), selected by cross-validation.

Images undergo random translation (±5%) and horizontal flipping, with datasets shuffled each epoch for diverse triplets. Both manually cropped and detection-based bounding boxes are supported, with performance evaluated on single-shot and multi-query protocols.

5. Empirical Results Across Benchmarks

CAN achieves state-of-the-art or highly competitive accuracy on major person re-id datasets.

Dataset Architecture Rank-1 Accuracy Comparison Baselines
CUHK01 (100 IDs) CAN (AlexNet) 82.8% PersonNet (71.1%)
CAN (VGG-16) 87.2% -
CUHK01 (486 IDs) CAN (VGG-16) 67.2% DNS (69.1%)
CUHK03 (labeled) CAN (VGG-16) 77.6% FT-JSTL+DGD (75.3%)
CUHK03 (detected) CAN (VGG-16) 69.2% DNS (54.7%), LOMO+XQDA (46.3%)
Market-1501 (single-query) CAN (VGG-16) 60.3%, mAP 35.9% DNS (61.1%, mAP 35.7%)
Market-1501 (multi-query) CAN (VGG-16) 72.1%, mAP 47.9% New SOTA
VIPeR CAN (VGG-16) 47.2% SCSP (53.5%)
VIPeR (feature fusion) CAN + LOMO 54.1% -

Ablation studies indicate the superiority of LSTM-based recurrent attention over average/max pooling or shallow FC alternatives, confirm that end-to-end joint training provides significant performance gain, and that optimal results are found with II0 glimpses and time-step concatenation at II1. VGG-16 consistently outperforms AlexNet by approximately 5% absolute margin. The use of soft attention yields a Rank-1 of 72.3%, outperforming alternatives (avg-LSTM 58.3%, max-LSTM 57.9%) (Liu et al., 2016).

6. Design Rationale and Comparative Context

CAN's iterative attention mechanism simulates human visual perception by adaptively selecting informative person image regions over several spatial glimpses. This enables more robust identity discrimination under occlusion, pose, and viewpoint variation compared to conventional global or hard part-based features. By jointly optimizing triplet and identification losses, CAN harmonizes global identity structure with fine-grained comparative learning, thereby achieving improved generalization.

Comparative results establish that soft attention and recurrent processing (LSTM) are critical components; direct substitution with FC layers or fixed-pooling significantly degrades performance. End-to-end training of the CNN backbone and attention module, as opposed to fixed feature extraction, further improves accuracy by over 25%.

7. Impact and Evaluation

The introduction of CAN set a new benchmark in person re-identification by integrating an end-to-end soft attention framework that adaptively attends to salient body regions across image pairs or triplets. Empirical results across CUHK01, CUHK03, Market-1501, and VIPeR demonstrate its superior accuracy relative to then-contemporary deep and hand-crafted methods, with major gains for larger, more challenging datasets. The approach of recurrent comparative attention has influenced subsequent architectures in video and multi-view recognition domains (Liu et al., 2016).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to End-to-End Comparative Attention Network (CAN).