MCFormer: Multi-Scale Correlation Transformer
- The paper introduces a unified Transformer architecture that models global and local correlations to handle vessel occlusions and intra-identity variations.
- It employs custom Global and Local Correlation Modules to align features across samples, achieving state-of-the-art performance on vessel re-ID benchmarks.
- The approach uses a multi-scale fusion strategy with channel-wise attention and momentum-based memory banks to suppress outlier effects and compensate for missing parts.
The Multi-scale Correlation-aware Transformer Network (MCFormer) is a unified Transformer-based architecture designed for maritime vessel re-identification (Re-ID), specifically addressing the substantial intra-identity variation and frequent local part-missing characteristic of vessel imagery. Unlike pedestrian-centric Re-ID models, MCFormer directly models inter-image global and local correlations across the entire training set via two custom modules—the Global Correlation Module (GCM) and the Local Correlation Module (LCM)—and fuses these correlation-aware features using a multi-scale strategy. The approach achieves state-of-the-art performance on multiple vessel Re-ID benchmarks by robustly suppressing outlier effects due to viewpoint or occlusion and compensating for missing parts via inter-sample feature alignment (Liu, 18 Nov 2025).
1. Architecture Overview and Data Flow
MCFormer operates on a training set , where each input image is split into non-overlapping patches. Each image embedding includes a class token and three part tokens , with positional embeddings summing to
A ViT-B/16 encoder (12 layers, 12 heads, ) produces global features (from the class token) and local features (from part tokens).
The pipeline comprises three sequential stages:
- Transformer Feature Extraction: Produces global and part-level embeddings.
- Correlation Modeling:
- GCM: Aggregates all into correlated representations via a learned affinity matrix .
- LCM: Maintains a momentum-updated memory bank of all local part features, aligns each with its -nearest positives, and optimizes a clustering loss to yield correlated local descriptors .
- Multi-scale Fusion: Fuses in a channel-wise manner, producing a final feature for identity classification and metric learning loss.
2. Global Correlation Module (GCM)
GCM explicitly models global consistency across the dataset. Given , linear projections , , yield for each :
The unnormalized affinity matrix is
Softmax normalization is applied row-wise:
Correlated global features are obtained as weighted sums:
For computational efficiency when is large, GCM supports low-rank factorization using landmark embeddings, reducing complexity from to . Empirical results indicate provides a good tradeoff between accuracy and compute.
3. Local Correlation Module (LCM)
LCM aligns local (part-wise) features across mini-batch or dataset to mitigate local occlusions or missing regions. For each part index , and image , a memory bank is maintained and updated via:
with momentum .
For part , cosine similarities are computed with the bank; the top- most similar positives, , are selected. The per-part clustering loss is
with the cosine similarity and the temperature. After training, k-NN-aligned local descriptors are aggregated and projected via a convolution to .
4. Multi-Scale Fusion Strategy
MCFormer fuses correlated global and local features, and , via channel-wise attention gating:
where is computed by a two-layer 1x1 convolutional network with Sigmoid/RELU activations and global average pooling. For multi-scale cases ( different resolutions), the fusion is extended across scales with learned weights :
Fusion via multi-scale channel attention outperforms addition and concatenation schemes, as shown by ablation.
5. Loss Functions and Optimization
The total loss is
where:
- : Cross-entropy identity classification,
- : Optional batch-hard triplet loss,
- : Sum of part-level clustering losses.
Typical coefficients are , , .
6. Experimental Evaluation and Ablations
MCFormer is evaluated on VesselReID (1248 IDs, 30,587 images), Warships-ReID (163 IDs, 4,780 images), and BoatReID (107 IDs, 5,523 images), using mAP and CMC (Rank-1,5,10). The model surpasses TransReID, e.g., on VesselReID achieving Rank-1/mAP of 72.8%/63.4% vs. 68.2%/58.7%. Ablation indicates both GCM and LCM contribute complementary gains:
| Method | VesselReID (R1/mAP) | Warships-ReID (R1/mAP) |
|---|---|---|
| ViT-B/16 baseline | 66.7/54.5 | 91.7/77.2 |
| + GCM only | 70.2/62.2 | 95.8/87.3 |
| + LCM only | 70.3/57.0 | 92.9/78.2 |
| GCM+LCM (MCFormer) | 72.8/63.4 | 96.1/88.4 |
Multi-scale channel-attention fusion yields superior accuracy over simple addition or concatenation. The optimal number of part tokens is . GCM’s landmark-based complexity reduction shows diminishing returns for . Attention map analyses reveal GCM attends to globally salient structures (hull, superstructure), while LCM emphasizes locally distinct regions.
7. Effectiveness and Extensions
Multi-scale correlation modeling is effective in the maritime domain due to:
- Strong intra-identity variation (caused by viewpoint, lighting, weather) requiring global consistency regularization,
- Frequent missing of local vessel parts (hull, funnel, mast) due to occlusion, best addressed by aggregating aligned local features from similar samples.
Potential extensions include temporal correlation for video-based sequential Re-ID, integrating cross-modal sensor data (SAR or infrared) via a unified memory bank, generalization to other Re-ID domains with similar part-level variability and occlusion (e.g., aerial vehicles, animals), and replacing the static memory bank with a differentiable cache for online/batch updates.
MCFormer demonstrates that Transformer Re-ID architectures benefit from explicit inter-sample correlation modeling at both global and local feature levels, combined via a learnable multi-scale fusion process, achieving robust and discriminative vessel identification in challenging maritime environments (Liu, 18 Nov 2025).