Papers
Topics
Authors
Recent
2000 character limit reached

MCFormer: Multi-Scale Correlation Transformer

Updated 25 November 2025
  • The paper introduces a unified Transformer architecture that models global and local correlations to handle vessel occlusions and intra-identity variations.
  • It employs custom Global and Local Correlation Modules to align features across samples, achieving state-of-the-art performance on vessel re-ID benchmarks.
  • The approach uses a multi-scale fusion strategy with channel-wise attention and momentum-based memory banks to suppress outlier effects and compensate for missing parts.

The Multi-scale Correlation-aware Transformer Network (MCFormer) is a unified Transformer-based architecture designed for maritime vessel re-identification (Re-ID), specifically addressing the substantial intra-identity variation and frequent local part-missing characteristic of vessel imagery. Unlike pedestrian-centric Re-ID models, MCFormer directly models inter-image global and local correlations across the entire training set via two custom modules—the Global Correlation Module (GCM) and the Local Correlation Module (LCM)—and fuses these correlation-aware features using a multi-scale strategy. The approach achieves state-of-the-art performance on multiple vessel Re-ID benchmarks by robustly suppressing outlier effects due to viewpoint or occlusion and compensating for missing parts via inter-sample feature alignment (Liu, 18 Nov 2025).

1. Architecture Overview and Data Flow

MCFormer operates on a training set {(xi,yi)}i=1D\{(x^i, y^i)\}_{i=1}^D, where each input image xix^i is split into NN non-overlapping patches. Each image embedding includes a class token xclsx_{\mathrm{cls}} and three part tokens {xp1,xp2,xp3}\{x_{p_1}, x_{p_2}, x_{p_3}\}, with positional embeddings PP summing to

X=[xcls; xp1; xp2; xp3; x1;;xN]+PX = [x_{\mathrm{cls}};\ x_{p_1};\ x_{p_2};\ x_{p_3};\ x_1; \ldots; x_N] + P

A ViT-B/16 encoder (12 layers, 12 heads, d=768d=768) produces global features gig^i (from the class token) and local features {l1i,l2i,l3i}\{l^i_1, l^i_2, l^i_3\} (from part tokens).

The pipeline comprises three sequential stages:

  1. Transformer Feature Extraction: Produces global and part-level embeddings.
  2. Correlation Modeling:
    • GCM: Aggregates all gig^i into correlated representations {ui}\{u^i\} via a learned affinity matrix ARD×DA \in \mathbb{R}^{D \times D}.
    • LCM: Maintains a momentum-updated memory bank of all local part features, aligns each with its kk-nearest positives, and optimizes a clustering loss to yield correlated local descriptors {vi}\{v^i\}.
  3. Multi-scale Fusion: Fuses (ui,vi)(u^i, v^i) in a channel-wise manner, producing a final feature ziz^i for identity classification and metric learning loss.

2. Global Correlation Module (GCM)

GCM explicitly models global consistency across the dataset. Given G=[g1;...;gD]RD×dG = [g^1; ...; g^D] \in \mathbb{R}^{D \times d}, linear projections ϕq\phi_q, ϕk\phi_k, ϕv\phi_v yield qi,kj,vjq^i, k^j, v^j for each gig^i:

qi=ϕq(gi),kj=ϕk(gj),vj=ϕv(gj).q^i = \phi_q(g^i), \quad k^j = \phi_k(g^j), \quad v^j = \phi_v(g^j).

The unnormalized affinity matrix is

Aij=qi,kjd.A_{ij} = \frac{\langle q^i, k^j \rangle}{\sqrt{d}}.

Softmax normalization is applied row-wise:

S(A)ij=exp(Aij)jexp(Aij).S(A)_{ij} = \frac{\exp(A_{ij})}{\sum_{j'} \exp(A_{ij'})}.

Correlated global features are obtained as weighted sums:

ui=j=1DS(A)ijvj.u^i = \sum_{j=1}^D S(A)_{ij} v^j.

For computational efficiency when DD is large, GCM supports low-rank factorization using d\ell \ll d landmark embeddings, reducing complexity from O(D2d)O(D^2 d) to O(D2)O(D^2 \ell). Empirical results indicate =5\ell = 5 provides a good tradeoff between accuracy and compute.

3. Local Correlation Module (LCM)

LCM aligns local (part-wise) features across mini-batch or dataset to mitigate local occlusions or missing regions. For each part index p{1,2,3}p \in \{1,2,3\}, and image jj, a memory bank Mp={wp1,...,wpD}M_p = \{w^1_p, ..., w^D_p\} is maintained and updated via:

wpj(1m)wpj+mlpjw^j_p \leftarrow (1-m) w^j_p + m l^j_p

with momentum m=0.2m=0.2.

For part lpjl^j_p, cosine similarities are computed with the bank; the top-kk most similar positives, PpjP^j_p, are selected. The per-part clustering loss is

Lpj=lognPpjexp(sj,n/τ)n=1Dexp(sj,n/τ)L^j_p = -\log \frac{\sum_{n \in P^j_p} \exp(s_{j,n}/\tau)}{\sum_{n=1}^D \exp(s_{j,n}/\tau)}

with sj,ns_{j,n} the cosine similarity and τ=0.07\tau=0.07 the temperature. After training, k-NN-aligned local descriptors l~pj\tilde{l}^j_p are aggregated and projected via a 1×11 \times 1 convolution to vjv^j.

4. Multi-Scale Fusion Strategy

MCFormer fuses correlated global and local features, uu and vRC×H×Wv \in \mathbb{R}^{C \times H \times W}, via channel-wise attention gating:

z=m(uv)u+(1m(uv))vz = m(u \oplus v) \otimes u + (1 - m(u \oplus v)) \otimes v

where m()m(\cdot) is computed by a two-layer 1x1 convolutional network with Sigmoid/RELU activations and global average pooling. For multi-scale cases (SS different resolutions), the fusion is extended across scales with learned weights αs\alpha_s:

z=s=1Sαs[ms(usvs)us+(1ms)vs]z = \sum_{s=1}^S \alpha_s \left[ m_s(u_s \oplus v_s) \otimes u_s + (1 - m_s) \otimes v_s \right]

Fusion via multi-scale channel attention outperforms addition and concatenation schemes, as shown by ablation.

5. Loss Functions and Optimization

The total loss is

Ltotal=λ1LCE+λ2Ltri+λ3LLCM\mathcal{L}_{\mathrm{total}} = \lambda_1 \mathcal{L}_{\mathrm{CE}} + \lambda_2 \mathcal{L}_{\mathrm{tri}} + \lambda_3 \mathcal{L}_{\mathrm{LCM}}

where:

  • LCE\mathcal{L}_{\mathrm{CE}}: Cross-entropy identity classification,
  • Ltri\mathcal{L}_{\mathrm{tri}}: Optional batch-hard triplet loss,
  • LLCM\mathcal{L}_{\mathrm{LCM}}: Sum of part-level clustering losses.

Typical coefficients are λ1=1\lambda_1 = 1, λ2=1\lambda_2 = 1, λ3=0.1\lambda_3 = 0.1.

6. Experimental Evaluation and Ablations

MCFormer is evaluated on VesselReID (1248 IDs, 30,587 images), Warships-ReID (163 IDs, 4,780 images), and BoatReID (107 IDs, 5,523 images), using mAP and CMC (Rank-1,5,10). The model surpasses TransReID, e.g., on VesselReID achieving Rank-1/mAP of 72.8%/63.4% vs. 68.2%/58.7%. Ablation indicates both GCM and LCM contribute complementary gains:

Method VesselReID (R1/mAP) Warships-ReID (R1/mAP)
ViT-B/16 baseline 66.7/54.5 91.7/77.2
+ GCM only 70.2/62.2 95.8/87.3
+ LCM only 70.3/57.0 92.9/78.2
GCM+LCM (MCFormer) 72.8/63.4 96.1/88.4

Multi-scale channel-attention fusion yields superior accuracy over simple addition or concatenation. The optimal number of part tokens is k=3k=3. GCM’s landmark-based complexity reduction shows diminishing returns for >5\ell>5. Attention map analyses reveal GCM attends to globally salient structures (hull, superstructure), while LCM emphasizes locally distinct regions.

7. Effectiveness and Extensions

Multi-scale correlation modeling is effective in the maritime domain due to:

  • Strong intra-identity variation (caused by viewpoint, lighting, weather) requiring global consistency regularization,
  • Frequent missing of local vessel parts (hull, funnel, mast) due to occlusion, best addressed by aggregating aligned local features from similar samples.

Potential extensions include temporal correlation for video-based sequential Re-ID, integrating cross-modal sensor data (SAR or infrared) via a unified memory bank, generalization to other Re-ID domains with similar part-level variability and occlusion (e.g., aerial vehicles, animals), and replacing the static memory bank with a differentiable cache for online/batch updates.

MCFormer demonstrates that Transformer Re-ID architectures benefit from explicit inter-sample correlation modeling at both global and local feature levels, combined via a learnable multi-scale fusion process, achieving robust and discriminative vessel identification in challenging maritime environments (Liu, 18 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multi-scale Correlation-aware Transformer Network (MCFormer).