Efficient Multi-scale Transformer (EMFormer)

Updated 8 February 2026

EMFormer is a transformer-based architecture that integrates fused multi-scale convolutions, hybrid attention, and gating mechanisms for efficient spatiotemporal modeling.
It employs accumulative context finetuning and composite loss functions to enhance long-range forecasting and multi-resolution PDE solving while reducing computational cost.
Empirical results show significant improvements in training speed and accuracy across weather forecasting, vision tasks, and PDE benchmarks compared to traditional methods.

An Efficient Multi-scale Transformer (EMFormer) is a transformer-based neural architecture designed to achieve computationally efficient, scalable multi-scale feature extraction and long-context modeling. The EMFormer paradigm is instantiated in fields such as weather forecasting (Chen et al., 1 Feb 2026) and PDE solving (Luo et al., 24 May 2025), as well as comparable vision and language domains. It combines architectural innovations—such as fused multi-scale convolutions, hybrid attention, and gating mechanisms for heterogeneous input handling—with advanced training strategies like accumulative context finetuning and composite sinusoidal loss functions to enhance accuracy and efficiency across diverse spatiotemporal tasks.

1. Architectural Principles of EMFormer

Conventional multi-scale transformers leverage parallel convolutional branches (e.g., 1×1, 3×3, 5×5 kernels) to capture receptive fields at various spatial scales, resulting in high compute complexity $\mathcal{O}(N \cdot H \cdot W \cdot r^2)$ . EMFormer eliminates these redundant branches by fusing their effect into a single “multi-convs” layer, mathematically equivalent to the sum of multiple convolutions but realized as a single kernel with cost $\mathcal{O}(H \cdot W \cdot r_\text{max}^2)$ . Let $K_1, K_3, K_5$ denote the separate kernels; all are zero-padded to a common $5 \times 5$ size and summed, yielding

$\mathbf Z' = \sum_{i,j} (K_1 \oplus K_3 \oplus K_5) \odot \mathbf Z[i,j,5]$

where $\oplus$ denotes zero-padding and alignment. Custom back-propagation kernels decouple gradients for each scale, preserving their independent optimization: $\frac{\partial L}{\partial K_r} = \sum_{i,j} \left( \frac{\partial L}{\partial \mathbf Z'[i,j]} \odot \mathbf Z[i,j, r] \right),\quad r \in \{1,3,5\}$ The overall backbone employs a hierarchical four-stage encoder–decoder structure, interleaving global self-attention and window-based attention blocks. Hybrid attention integrates global and local context, while residual pruning/recovering allows information to flow across scales without excess compute (Chen et al., 1 Feb 2026).

In PDE applications, an encoder–decoder structure is decoupled: mesh points form encoder input while queries supply decoder input, permitting zero-shot resolution changes. A Gated Condition Embedding (GCE) module encodes diverse boundary, geometry, and physical parameters into fixed-dimensional tokens, ensuring disambiguation between zero-value and missing inputs (Luo et al., 24 May 2025).

2. Multi-Scale Feature Extraction Mechanisms

EMFormer’s efficiency in multi-scale modeling derives from utilizing a single convolutional operation to synthesize multi-scale cues. This contrasts with earlier methods (e.g., the pyramid pooling or multi-branch schemas in Lawin Transformer (Yan et al., 2022)) and eliminates multiple forward/backward paths.

For structured vision tasks, the backbone downsamples and upsamples feature maps across four spatial resolutions, injecting fused multi-scale information at each stage. In the context of PDEs, spatial locality is preserved during 1D sequence serialization using high-order Hilbert curves, followed by patch grouping and embedding. This patching reduces attention complexity from $\mathcal{O}(L^2 d)$ to $\mathcal{O}((L/P)^2 d)$ for sequence length $L$ and patch size $P$ .

3. Accumulative Context Finetuning for Long-Range Forecasting

EMFormer introduces an accumulative context finetuning pipeline to mitigate catastrophic forgetting and error accumulation in auto-regressive, long-horizon settings. The mechanism manages a fixed-length cache of key–value (KV) pairs, updating and pruning based on blended historical and current attention scores. At each step, current and historical attention scores are combined: $\mathbf S_{\rm new}[1:N\!-\!1] = \lambda \,\mathbf S_{\rm cur}[1:N\!-\!1] + (1-\lambda)\,\mathbf S_{\rm his}[1:N\!-\!1],$ with the latest token always preserved. KV pairs with the highest aggregated scores are retained, enforcing recency and salience. Pseudocode for the cache update is specified in (Chen et al., 1 Feb 2026).

In the PDE framework, a decoupled encoder–decoder eliminates intra-query dependencies; cross-attention from arbitrary query grids to encoder outputs allows for consistent accuracy even under resolution changes—a property critical for multi-resolution physical simulations.

4. Composite Loss Functions and Optimization

In geospatial and variable-rich tasks, EMFormer balances location-aware and variable-adaptive objectives through a composite loss with a learnable sinusoidal schedule: $\mathcal{L} = \frac{1}{2}(1-\sin\theta)\,\mathcal{L}_\mathrm{lat} + \frac{1}{2}(1+\sin\theta)\,\mathcal{L}_\mathrm{var},$ where $\mathcal{L}_\mathrm{lat}$ is a latitude-weighted MSE and $\mathcal{L}_\mathrm{var}$ is a variable-adaptive MSE with per-variable rates. The scalar $\theta$ is trained, evolving from $-\pi/2$ to $+\pi/2$ , and seamlessly interpolates between geographic and channel-specific optimization throughout pretraining and finetuning (Chen et al., 1 Feb 2026).

For PDEs, the relative $L_2$ norm is used: $\mathcal{L} = \frac{1}{N} \sum_{i=1}^N \frac{\|u_i - \hat u_i\|_2}{\|u\|_2}$ Other optimization techniques include AdamW, L-BFGS for physics-only scenarios, and cosine-annealing schedules.

5. Computational Efficiency and Scaling Analysis

Fusing multi-scale convolutions yields explicit arithmetic savings: the forward and backward pass is accelerated by a factor of 5.69 relative to naïve multi-branch layers, with theoretical FLOP count reduced from $\mathcal{O}(N \cdot H_0 W_0 r^2)$ to $\mathcal{O}(H_0 W_0 r_\text{max}^2)$ . On ERA5 weather prediction, training time dropped from 83 h to 60 h (27% reduction), while ImageNet-1K training with EMFormer-B (multi-convs) was 13% faster than a standard module. Image throughput and latency similarly improve without loss of accuracy (ERA5: 98 ms vs. 115 ms for STCast) (Chen et al., 1 Feb 2026).

In multi-scale PDE solvers, the cost of attention reduces from quadratic to nearly linear in mesh size, due to patching and efficient input embedding. Memory requirements scale as $\mathcal{O}((L/P)^2 d)$ instead of $\mathcal{O}(L^2 d)$ (Luo et al., 24 May 2025).

6. Empirical Results in Core Benchmarks

Weather and Geoscience

Long-term weather forecasting (1.4° grid): EMFormer, with accumulative context finetuning, achieved 6 h RMSE 0.0599 and ACC 0.9949, 10 day RMSE 0.5094 and ACC 0.5389—outperforming Pangu-Weather, GraphCast, and OneForecast over 10-day horizons.
Typhoon track prediction: Mean distance error of 88.49 km on Western Pacific cyclones versus 119.17 km for AIFS, maintaining lead at long lead times (Chen et al., 1 Feb 2026).

Vision

ImageNet-1K classification (EMFormer-T/S/B): Top-1 accuracy of 83.2%, 84.1%, 84.4% at 5.1, 7.4, 12.3 GFLOPs, outperforming or matching ConvNeXt-T and strong multi-scale vision baselines.
ADE20K semantic segmentation (EMFormer-B): mIoU 49.6% at 69M/251G FLOPs, matching or surpassing Swin and Mamba baselines at ∼25% lower FLOPs (Chen et al., 1 Feb 2026).

PDE Solving

Robust relative $L_2$ accuracy across benchmarks (Poisson, DarcyFlow, ShapeOps, Heat2d). EMFormer achieves competitive or leading results (e.g. Shape-Car: $1.93 \times 10^{-2}$ ), and up to $2\times$ speedup in inference over alternatives such as Transolver.
Gated Condition Embedding module outperforms simple MLPs (error $<0.015$ vs. $>0.3$ ), with moderate patch size $P=4$ optimal for accuracy/memory trade-off (Luo et al., 24 May 2025).

7. Context within Multi-Scale Transformers

EMFormer represents a general, scalable abstraction for multi-scale modeling. In contrast to Lawin Transformer (Yan et al., 2022), which employs large window attention and multi-path decoders with moderate efficiency gains, EMFormer targets end-to-end reduction in arithmetic and memory cost at all scales. While FMMformers (Nguyen et al., 2021) use a decomposition of attention into near-field (banded) and far-field (low-rank) parts for $O(N)$ complexity, EMFormer fuses convolutional intermediates and focuses on fine-grained, application-specific multi-scale trade-offs.

EMFormer’s abstraction is versatile, supporting:

Arbitrary spatial resolutions and query sets via encoder–decoder decoupling (e.g., in PDEs),
Salient temporal memory management via accumulative finetuning (e.g., in forecasting),
Heterogeneous physical/structural input via gating and learned fusion,
Application-driven loss balancing via continuous, trainable objectives.

A plausible implication is that EMFormer-type approaches will remain foundational as transformer modeling further penetrates space–time forecasting, scientific computing, and multi-task vision.

References

(Chen et al., 1 Feb 2026) "EMFormer: Efficient Multi-Scale Transformer for Accumulative Context Weather Forecasting"
(Luo et al., 24 May 2025) "MMET: A Multi-Input and Multi-Scale Transformer for Efficient PDEs Solving"
(Yan et al., 2022) "Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention"
(Nguyen et al., 2021) "FMMformer: Efficient and Flexible Transformer via Decomposed Near-field and Far-field Attention"