MCFormer: Multi-Scale Correlation Transformer

Updated 25 November 2025

The paper introduces a unified Transformer architecture that models global and local correlations to handle vessel occlusions and intra-identity variations.
It employs custom Global and Local Correlation Modules to align features across samples, achieving state-of-the-art performance on vessel re-ID benchmarks.
The approach uses a multi-scale fusion strategy with channel-wise attention and momentum-based memory banks to suppress outlier effects and compensate for missing parts.

The Multi-scale Correlation-aware Transformer Network (MCFormer) is a unified Transformer-based architecture designed for maritime vessel re-identification (Re-ID), specifically addressing the substantial intra-identity variation and frequent local part-missing characteristic of vessel imagery. Unlike pedestrian-centric Re-ID models, MCFormer directly models inter-image global and local correlations across the entire training set via two custom modules—the Global Correlation Module (GCM) and the Local Correlation Module (LCM)—and fuses these correlation-aware features using a multi-scale strategy. The approach achieves state-of-the-art performance on multiple vessel Re-ID benchmarks by robustly suppressing outlier effects due to viewpoint or occlusion and compensating for missing parts via inter-sample feature alignment (Liu, 18 Nov 2025).

1. Architecture Overview and Data Flow

MCFormer operates on a training set $\{(x^i, y^i)\}_{i=1}^D$ , where each input image $x^i$ is split into $N$ non-overlapping patches. Each image embedding includes a class token $x_{\mathrm{cls}}$ and three part tokens $\{x_{p_1}, x_{p_2}, x_{p_3}\}$ , with positional embeddings $P$ summing to

$X = [x_{\mathrm{cls}};\ x_{p_1};\ x_{p_2};\ x_{p_3};\ x_1; \ldots; x_N] + P$

A ViT-B/16 encoder (12 layers, 12 heads, $d=768$ ) produces global features $g^i$ (from the class token) and local features $\{l^i_1, l^i_2, l^i_3\}$ (from part tokens).

The pipeline comprises three sequential stages:

Transformer Feature Extraction: Produces global and part-level embeddings.
Correlation Modeling:
- GCM: Aggregates all $g^i$ into correlated representations $\{u^i\}$ via a learned affinity matrix $A \in \mathbb{R}^{D \times D}$ .
- LCM: Maintains a momentum-updated memory bank of all local part features, aligns each with its $k$ -nearest positives, and optimizes a clustering loss to yield correlated local descriptors $\{v^i\}$ .
Multi-scale Fusion: Fuses $(u^i, v^i)$ in a channel-wise manner, producing a final feature $z^i$ for identity classification and metric learning loss.

2. Global Correlation Module (GCM)

GCM explicitly models global consistency across the dataset. Given $G = [g^1; ...; g^D] \in \mathbb{R}^{D \times d}$ , linear projections $\phi_q$ , $\phi_k$ , $\phi_v$ yield $q^i, k^j, v^j$ for each $g^i$ :

$q^i = \phi_q(g^i), \quad k^j = \phi_k(g^j), \quad v^j = \phi_v(g^j).$

The unnormalized affinity matrix is

$A_{ij} = \frac{\langle q^i, k^j \rangle}{\sqrt{d}}.$

Softmax normalization is applied row-wise:

$S(A)_{ij} = \frac{\exp(A_{ij})}{\sum_{j'} \exp(A_{ij'})}.$

Correlated global features are obtained as weighted sums:

$u^i = \sum_{j=1}^D S(A)_{ij} v^j.$

For computational efficiency when $D$ is large, GCM supports low-rank factorization using $\ell \ll d$ landmark embeddings, reducing complexity from $O(D^2 d)$ to $O(D^2 \ell)$ . Empirical results indicate $\ell = 5$ provides a good tradeoff between accuracy and compute.

3. Local Correlation Module (LCM)

LCM aligns local (part-wise) features across mini-batch or dataset to mitigate local occlusions or missing regions. For each part index $p \in \{1,2,3\}$ , and image $j$ , a memory bank $M_p = \{w^1_p, ..., w^D_p\}$ is maintained and updated via:

$w^j_p \leftarrow (1-m) w^j_p + m l^j_p$

with momentum $m=0.2$ .

For part $l^j_p$ , cosine similarities are computed with the bank; the top- $k$ most similar positives, $P^j_p$ , are selected. The per-part clustering loss is

$L^j_p = -\log \frac{\sum_{n \in P^j_p} \exp(s_{j,n}/\tau)}{\sum_{n=1}^D \exp(s_{j,n}/\tau)}$

with $s_{j,n}$ the cosine similarity and $\tau=0.07$ the temperature. After training, k-NN-aligned local descriptors $\tilde{l}^j_p$ are aggregated and projected via a $1 \times 1$ convolution to $v^j$ .

4. Multi-Scale Fusion Strategy

MCFormer fuses correlated global and local features, $u$ and $v \in \mathbb{R}^{C \times H \times W}$ , via channel-wise attention gating:

$z = m(u \oplus v) \otimes u + (1 - m(u \oplus v)) \otimes v$

where $m(\cdot)$ is computed by a two-layer 1x1 convolutional network with Sigmoid/RELU activations and global average pooling. For multi-scale cases ( $S$ different resolutions), the fusion is extended across scales with learned weights $\alpha_s$ :

$z = \sum_{s=1}^S \alpha_s \left[ m_s(u_s \oplus v_s) \otimes u_s + (1 - m_s) \otimes v_s \right]$

Fusion via multi-scale channel attention outperforms addition and concatenation schemes, as shown by ablation.

5. Loss Functions and Optimization

The total loss is

$\mathcal{L}_{\mathrm{total}} = \lambda_1 \mathcal{L}_{\mathrm{CE}} + \lambda_2 \mathcal{L}_{\mathrm{tri}} + \lambda_3 \mathcal{L}_{\mathrm{LCM}}$

where:

$\mathcal{L}_{\mathrm{CE}}$ : Cross-entropy identity classification,
$\mathcal{L}_{\mathrm{tri}}$ : Optional batch-hard triplet loss,
$\mathcal{L}_{\mathrm{LCM}}$ : Sum of part-level clustering losses.

Typical coefficients are $\lambda_1 = 1$ , $\lambda_2 = 1$ , $\lambda_3 = 0.1$ .

6. Experimental Evaluation and Ablations

MCFormer is evaluated on VesselReID (1248 IDs, 30,587 images), Warships-ReID (163 IDs, 4,780 images), and BoatReID (107 IDs, 5,523 images), using mAP and CMC (Rank-1,5,10). The model surpasses TransReID, e.g., on VesselReID achieving Rank-1/mAP of 72.8%/63.4% vs. 68.2%/58.7%. Ablation indicates both GCM and LCM contribute complementary gains:

Method	VesselReID (R1/mAP)	Warships-ReID (R1/mAP)
ViT-B/16 baseline	66.7/54.5	91.7/77.2
+ GCM only	70.2/62.2	95.8/87.3
+ LCM only	70.3/57.0	92.9/78.2
GCM+LCM (MCFormer)	72.8/63.4	96.1/88.4

Multi-scale channel-attention fusion yields superior accuracy over simple addition or concatenation. The optimal number of part tokens is $k=3$ . GCM’s landmark-based complexity reduction shows diminishing returns for $\ell>5$ . Attention map analyses reveal GCM attends to globally salient structures (hull, superstructure), while LCM emphasizes locally distinct regions.

7. Effectiveness and Extensions

Multi-scale correlation modeling is effective in the maritime domain due to:

Strong intra-identity variation (caused by viewpoint, lighting, weather) requiring global consistency regularization,
Frequent missing of local vessel parts (hull, funnel, mast) due to occlusion, best addressed by aggregating aligned local features from similar samples.

Potential extensions include temporal correlation for video-based sequential Re-ID, integrating cross-modal sensor data (SAR or infrared) via a unified memory bank, generalization to other Re-ID domains with similar part-level variability and occlusion (e.g., aerial vehicles, animals), and replacing the static memory bank with a differentiable cache for online/batch updates.

MCFormer demonstrates that Transformer Re-ID architectures benefit from explicit inter-sample correlation modeling at both global and local feature levels, combined via a learnable multi-scale fusion process, achieving robust and discriminative vessel identification in challenging maritime environments (Liu, 18 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Multi-Scale Correlation-Aware Transformer for Maritime Vessel Re-Identification (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Multi-scale Correlation-aware Transformer Network (MCFormer).

MCFormer: Multi-Scale Correlation Transformer

1. Architecture Overview and Data Flow

2. Global Correlation Module (GCM)

3. Local Correlation Module (LCM)

4. Multi-Scale Fusion Strategy

5. Loss Functions and Optimization

6. Experimental Evaluation and Ablations

7. Effectiveness and Extensions

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

MCFormer: Multi-Scale Correlation Transformer

1. Architecture Overview and Data Flow

2. Global Correlation Module (GCM)

3. Local Correlation Module (LCM)

4. Multi-Scale Fusion Strategy

5. Loss Functions and Optimization

6. Experimental Evaluation and Ablations

7. Effectiveness and Extensions

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research