Multi-Scale Patch Embedding (MSPE)

Updated 22 December 2025

Multi-Scale Patch Embedding (MSPE) is a strategy that extracts and fuses patches at various scales to capture both fine details and global context in data.
It employs parallel convolutions and adaptive routing in transformer and capsule architectures to improve robustness across variable resolutions.
MSPE has practical applications in vision transformers, capsule networks, and ECG denoising, demonstrating significant performance gains in classification and signal processing.

Multi-Scale Patch Embedding (MSPE) is a class of architectural techniques in deep learning that augment models’ ability to encode input signals—whether images or time series—by simultaneously extracting and fusing representations from multiple patch (kernel) sizes or feature-map scales. This approach enhances models’ robustness, context modeling, and invariance to input resolution or signal characteristics, with key applications in vision transformers, capsule networks, and transformer-based sequence models. MSPE has been central to recent developments in adapting transformers to variable-resolution vision tasks (Liu et al., 28 May 2024), improving visual recognition in capsule architectures (Hu et al., 23 Aug 2025), and denoising biomedical signals such as ECG (Zhu et al., 12 Jul 2024).

1. Motivation and Principle of Multi-Scale Patch Embedding

Traditional patch embedding layers in transformers or capsule networks partition input data into fixed-size patches, e.g., $16\times16$ image blocks or fixed-width 1D sequences, projecting each into one or more embedding vectors. This approach imposes a single observation scale, resulting in potential loss of multiresolution features and context. Apparent shortcomings include:

Limited adaptation to variable input resolution: Standard patch-embeddings trained for one resolution degrade under changing input sizes (Liu et al., 28 May 2024).
Insufficient context aggregation: Fine scale embeddings are sensitive to noise, while coarse scales may miss fine-grained details (Zhu et al., 12 Jul 2024).
Poor multi-scale fusion: Naïve concatenation or addition of scale-specific representations is suboptimal for models requiring rich hierarchical context (Hu et al., 23 Aug 2025).

MSPE remedies these issues by extracting parallel embeddings at multiple scales, leveraging distinct kernel sizes or hierarchical feature-maps, and fusing the resulting representations either by concatenation or adaptive routing.

2. MSPE for Vision Transformers: Architecture and Training

In “MSPE: Multi-Scale Patch Embedding Prompts Vision Transformers to Any Resolution” (Liu et al., 28 May 2024), the MSPE module replaces the standard patch-embedding convolution in vision transformers with a set of $K$ parallel convolutions ${g_{\theta^1}, ..., g_{\theta^K}}$ , each with a distinct kernel size $(h_k^i, w_k^i)$ and learnable weights $\omega_{\theta^i}$ . For an input $X \in \mathbb{R}^{H \times W \times C}$ :

Scale-adaptive kernel resizing: Each kernel is resized to fit the nominal target patch size for the test resolution, using pseudo-inverse resizing so that

$\text{vec}(\omega_{\theta^i}^{(H,W)}) = (B_r^{r_i})^+ \text{vec}(\omega_{\theta^i})$

where $B_r^{r_i}$ is the bilinear resize matrix, ensuring each convolution adapts to the current input resolution.

Embedding fusion: For each scale, the patch embedding $Z^i$ is computed and optionally positionally encoded via bilinear interpolation to the corresponding patch grid. During inference for a given test resolution $r^*$ , the embedding branch whose training resolution-group centroid is closest to $r^*$ is selected.
Multi-resolution training and selection: The set of candidate resolutions is partitioned into $K$ groups; for each group, the associated kernel is specifically optimized during training. The only trainable parameters are the MSPE kernels and biases; all transformer encoder weights are frozen.

MSPE enables robust performance across a wide input resolution range without retraining. Experimentally, MSPE surpasses both vanilla ViT and FlexiViT for ImageNet-1K classification, particularly in low-resolution regimes (e.g., Top-1 of 56.4% at $28\times28$ , compared to 8.5% for vanilla ViT) (Liu et al., 28 May 2024).

3. MSPE in Capsule Networks: Multi-Scale Patchify Capsule Layer

“MSPCaps: A Multi-Scale Patchify Capsule Network with Cross-Agreement Routing for Visual Recognition” (Hu et al., 23 Aug 2025) leverages MSPE in a capsule network architecture optimized for visual recognition. The pipeline contains three key stages:

Multi-Scale ResNet Backbone (MSRB): Produces three feature-maps $f_1, f_2, f_3$ at progressively coarser spatial scales, e.g., $f_1\in\mathbb{R}^{32\times32\times32}$ , $f_2\in\mathbb{R}^{64\times16\times16}$ , $f_3\in\mathbb{R}^{128\times8\times8}$ . Each scale preserves complementary context.
Patchify Capsule Layer (PatchifyCaps):
- Each feature-map $f_i$ is partitioned into non-overlapping $p\times p$ macro-patches (typically $p=4$ ).
- Each patch is projected via a $1\times1$ convolution into a $d_i$ -dimensional capsule, with learned weights $W_i^{\rm emb} \in \mathbb{R}^{d_i \times (p^2C_i)}$ .
- Positional embedding and LayerNorm are added to preserve spatial information.
Capsule Fusion via Cross-Agreement Routing (CAR):
- Instead of naive fusion, CAR blocks adaptively select the most coherent cross-scale part-whole capsule alignments by maximizing agreement between finer and coarser scale votes.
- This two-stage routing first fuses high- and mid-resolution capsules, then merges with low-resolution (global) capsules, producing final class capsules optimized for hierarchically consistent feature representations.

Empirically, ablations demonstrate a substantial accuracy gain for multi-scale Capsule-based patch embeddings versus any single-scale alternative, indicating that multi-scale encoding and CAR fusion synergistically enhance classification robustness and feature extraction (Hu et al., 23 Aug 2025).

4. MSPE for Sequential Data: Transformer-Based ECG Denoising

In “ECG Signal Denoising Using Multi-scale Patch Embedding and Transformers” (Zhu et al., 12 Jul 2024), MSPE is adapted for one-dimensional sequential (ECG) data as follows:

Patch extraction by 1D convolutions: Given raw ECG snippets $X \in \mathbb{R}^{2 \times 256}$ (two channels, length $256$), overlapping “patches” $p_i^{(k)}$ of length $k$ are extracted for $k \in \mathcal{K} = \{3, 5, 7, 9\}$ , using zero-padding and stride $s=1$ to preserve sequence length.
Linear projection and concatenation: Each patch $p_i^{(k)}$ is linearly projected into a $2$-dimensional embedding. At each time-position $i$ the embeddings from each scale are concatenated to yield $e_i \in \mathbb{R}^8$ .
Positional encoding: A learnable (or sinusoidal) encoding $P\in\mathbb{R}^{L \times 8}$ is added, producing input $Z \in \mathbb{R}^{256 \times 8}$ for the downstream transformer.
Hyperparameter choices: $K=4$ scales, each with $d_k=2$ embedding dimension, and $s=1$ (“same” padding) maximize the resolution-context tradeoff. Small $k$ emphasizes high-frequency (e.g., muscle artifact), while large $k$ enhances global (e.g., drift) awareness.

MSPE enables the transformer to access signal structure at multiple time scales, thereby improving denoising performance (SNR increases, RMSE decreases) and downstream ECG classification relative to any fixed scale alone (Zhu et al., 12 Jul 2024).

5. Comparative Analysis and Empirical Results

The following table aggregates core characteristics and empirical findings for representative recent MSPE implementations:

Paper (arXiv ID)	Input Modality	MSPE Structure & Fusion	Key Performance Insights
(Liu et al., 28 May 2024)	Images	K × (multi-scale conv); selection by test resolution	Outperforms vanilla ViT and FlexiViT at low resolution; minimal change to frozen backbone
(Hu et al., 23 Aug 2025)	Images	Multi-scale feature maps; PatchifyCaps + Cross-Agreement Routing	Three-scale fusion gives higher accuracy than any single-scale (e.g., 88.71% vs 74.81–87.48% on CIFAR10)
(Zhu et al., 12 Jul 2024)	1D ECG signal	4 × 1D “same” convolutions; concatenation over scales	Improves SNR/RMSE and classification by fusing scales specializing in high- vs. low-frequency patterns

Empirical consensus indicates that multi-scale embedding strategies result in more robust representations and consistent accuracy gains, especially under variation in input resolution or signal characteristics.

6. Mechanistic Rationale, Limitations, and Prospective Directions

Single-scale patch embeddings are intrinsically limited in their capacity to represent both fine-grained and coarse-grained features. In contrast, MSPE offers simultaneous access to:

Fine and coarse context: Small-scale patches enhance sensitivity to local edges, high frequencies, or sharp patterns (e.g., QRS in ECG; corners in images); large patches capture longer-term trends or global context (baseline drift in ECG; large objects/scenes in vision).
Resolution and invariance: Explicit multi-scale design reduces model reliance on fixed pre-processing and increases adaptability (Liu et al., 28 May 2024).

Known limitations include simplistic positional encoding interpolation, patch-embedding-only adaptation (leaving frozen transformers suboptimal), and unexplored directions in soft-gated or attention-based scale fusion (Liu et al., 28 May 2024). Future research aims to integrate advanced positional encoding, joint downstream encoder fine-tuning, and improved scale selection mechanisms.

7. Application-Specific Strategies and Outcomes

MSPE methods are effective across vision tasks (classification, segmentation, detection) and sequential modeling, with domain-appropriate adaptations:

Vision: Multi-scale convolutions or PatchifyCaps permit ViT and capsule architectures to achieve high accuracy across input sizes with limited computational overhead (Liu et al., 28 May 2024, Hu et al., 23 Aug 2025).
Sequence modeling: Overlapping, multi-scale 1D patches allow transformers to disentangle temporally-localized noise from global drift in physiological signals (Zhu et al., 12 Jul 2024).

Quantitative results demonstrate the generalizability and practical importance of MSPE: for instance, in semantic segmentation, MSPE raises mIoU from ≈3% to ≈40% at $128\times128$ input size over baseline SETR (Liu et al., 28 May 2024); in ECG denoising, multi-scale patching outperforms any single-scale in both SNR and RMSE (Zhu et al., 12 Jul 2024); in visual recognition, multi-scale capsule fusion achieves accuracy not attainable with single-scale capsule or ViT methods (Hu et al., 23 Aug 2025). This suggests that MSPE constitutes an essential component for next-generation robust transformer-based and capsule-based models.