SEAFormer: Domain-Adapted Transformer Models
- SEAFormer is a paradigm of Transformer-based neural architectures that integrate specialized attention modules (e.g., Squeeze-Enhanced Axial, State-Exchange) tailored to diverse domains.
- The models employ domain-specific mechanisms to achieve efficiency and scalability, as demonstrated in mesh-based physics simulations, mobile visual recognition, and combinatorial optimization.
- SEAFormer frameworks set state-of-the-art benchmarks by reducing errors, lowering latency, and improving interpretability across simulations, language modeling, and real-world routing problems.
SEAFormer refers to a set of Transformer-based neural architectures that introduce novel, domain-adapted attention mechanisms to address efficiency or fidelity challenges in various domains, including computer vision, molecular/physics simulation, combinatorial optimization, and language modeling. The term SEAFormer is associated with several major and technically distinct frameworks, each of which leverages a specialized “SEA” (Squeeze-Enhanced Axial, State-Exchange Attention, Sparse/Estimated Attention, or Spatial/Edge-aware Attention) module tailored to its application context.
1. SEAFormer in Physics-based Simulation: State-Exchange Attention Transformers
SEAFormer, as introduced by Esmati et al. (Esmati et al., 2024), is a physics-aware autoregressive architecture for mesh-based dynamical system modeling. The key architectural contributions are (i) a compact Vision Transformer (ViT)-style mesh autoencoder and (ii) the State-Exchange Attention (SEA) module for cross-field information exchange, thereby enabling accurate joint prediction of strongly interdependent physical fields without the error accumulation typical in rollouts.
Mesh Autoencoder Structure:
- The spatial domain is discretized into nodes, each assigned to one of spatial patches via threshold-based partitioning. Each patch encodes physical fields per cell.
- Patch tensors are zero-padded to uniform size and embedded into a latent dimension using per-patch MLPs and a multi-head self-attention (MHSA) block. The encoder output is .
- The decoder mirrors the encoder with symmetric MLPs and un-flattening, yielding reconstruction loss .
- Empirical reconstruction errors for cylinder flow and multiphase flow are and , respectively, substantially surpassing recent baselines.
Autoregressive Temporal Modeling:
- Each field group is embedded and passed to a causal, decoder-only Transformer. The model innovates with (a) Adaptive LayerNorm (AdaLN), conditioning on time-invariant system parameters, and (b) SEA for explicit cross-field attention.
- The SEA module, for each expert , computes post-self-attention activations and exchanges information across field-experts via bottlenecked cross-attention: . The sum of these is injected into each expert, ensuring multidirectional state variable exchange.
Parameter Conditioning and Decoding:
- Time-invariant parameters are projected and injected after SEA via a two-layer MLP, enabling field-dependent temporal prediction under varied physical regimes.
Results and Metrics:
- Table 1 below summarizes rollout mean squared error (MSE) results on canonical CFD (cylinder flow, ), indicating the lowest error among all compared models.
| Model | u | v | p | Avg |
|---|---|---|---|---|
| MGN | 98 | 2036 | 673 | 936 |
| MGN-NI | 25 | 778 | 136 | 313 |
| GMR-GMUS Transformer | 4.9 | 89 | 38 | 44 |
| PbGMR-GMUS + RealNVP | 3.8 | 74 | 20 | 32.6 |
| ViT-SEA (Ours) | 0.35 | 10.7 | 0.30 | 3.7 |
SEAFormer's relative error reductions—88% versus PbGMR-GMUS+RealNVP and 91% versus GMR-GMUS Transformer—establish state-of-the-art performance for high-fidelity physics-based rollouts (Esmati et al., 2024).
2. SEAFormer for Mobile Visual Recognition: Squeeze-Enhanced Axial Attention
In vision, SEAFormer denotes a family of Squeeze-enhanced Axial Transformer backbones for efficient semantic segmentation, classification, and detection on mobile devices (Wan et al., 2023). The "SEA" block integrates:
Squeeze Axial Attention (SAA):
- Attention computation is linear in spatial size, achieved via average-pooling ("squeezing") along the horizontal and vertical axes:
- The context vector for each pixel at incorporates both row and column-wise contributions, supplemented by learnable position biases.
Detail Enhancement (DE):
- Local spatial detail, attenuated by axis-squeezing, is recovered via depthwise separable convolutions and fused with SEA output through elementwise multiplication, as additive fusion is empirically inferior for mIoU.
- The final transformer layer is SEA attention + DE, followed by residuals, LayerNorm, FFN, and LayerNorm.
Backbone Scaling and Results:
- SEAFormer backbones are provided in Tiny, Small, Base, and Large variants, each tuned by the number of SEA blocks, layer widths, and attention head counts.
- Empirical benchmarks on datasets ADE20K, Cityscapes, Pascal Context, and COCO-Stuff demonstrate state-of-the-art speed/accuracy tradeoffs:
| Backbone | Params | FLOPs | mIoU (ADE20K) | Latency (ms) |
|---|---|---|---|---|
| SeaFormer-T | 1.7 M | 0.6 G | 35.0% | 40 |
| SeaFormer-L | 14.0 M | 6.5 G | 42.7% | 367 |
These results exemplify SEAFormer’s competitive semantic segmentation performance under stringent latency and hardware constraints (Wan et al., 2023).
3. SEAFormer for Large-Scale Combinatorial Optimization: Clustered Proximity and Edge-Aware Transformers
SEAFormer has also been instantiated as a spatial proximity– and edge-aware Transformer for real-world vehicle routing problems (RWVRPs) (Basharzad et al., 27 Jan 2026). This design targets sequence-dependent, constraint-rich optimization at scale, combining two key mechanisms:
1. Clustered Proximity Attention (CPA):
- Efficient sparse attention is realized by partitioning nodes into spatial clusters and restricting attention to local clusters plus the depot, reducing quadratic complexity to linear .
- Multiple clustering “rounds” (with different mixtures of radial/angular coordinates) preserve global context, and outputs are aggregated.
2. Edge-Aware Module (EAM):
- Edge embeddings for each node's nearest neighbors encode pairwise state information, supporting asymmetric costs and resource constraints. Residual fusion with node features is expressed as:
Empirical Results:
- On classical CVRP and four RWVRP variants, SEAFormer matches or outperforms state-of-the-art neural and heuristic solvers, achieving zero optimality gap on 5K- and 7K-customer benchmarks and establishing the first neural method able to solve 1000+ node RWVRPs effectively.
| Method | Obj (5K) | Gap % | Time (min) |
|---|---|---|---|
| UDC | 139.0 | 0.65 | 15 |
| SEAFormer | 138.1 | 0.00 | 22 |
Ablation studies confirm the necessity of both CPA (for memory) and EAM (for quality, especially under asymmetry) (Basharzad et al., 27 Jan 2026).
4. SEAFormer: Sparse Linear Attention with Estimated Masks
The SEAFormer (SEA: Sparse linear attention with Estimated Attention mask) framework redefines efficient self-attention for long-sequence modeling in language and other domains (Lee et al., 2023):
Mathematical Framework:
- Linear attention (Performer-style): maps queries/keys to low-rank features so approximates in time.
- A lightweight CNN/MLP stack estimates a compressed attention mask from and a projected identity matrix.
- Top- indices per query are selected, interpolated back to blocks of the full sequence, and sparse attention is computed only for these blocks.
- The final output, , recovers global context.
Language Modeling Results:
- On WikiText-2 (OPT-1.3B, ), SEAFormer achieves perplexity 13.5, outperforming Performer (30.6) and the dense teacher (13.9), but with halved VRAM usage (499 MB vs 1120 MB for vanilla OPT-1.3B).
Interpretability:
- SEAFormer provides a reconstructible, interpretable approximation of the attention matrix, capable of visualizing block- and content-adaptive patterns that static masks (Longformer, BigBird) cannot reproduce (Lee et al., 2023).
5. Implementation and Training Protocols
Key implementation details across SEAFormer variants:
- Physics SEAFormer: Mesh autoencoder, 12 layers/8 heads, no dropout; AdamW optimizer, learning rates (autoencoder) and (temporal).
- Visual SEAFormer: Pretrained on ImageNet-1K, transferred to segmentation in mmseg with Sync-BN, batch sizes 16–32, poly LR, and weight decay (seg), (cls).
- Vehicle Routing SEAFormer: Encoder (6 layers/8 heads), edge embeddings (dim 32), batch size 64, POMO size 100, fixed LR , FlashAttention kernels, proactive masking for hard constraints.
- Sparse/Masked SEAFormer: Two-phase distillation onto a pretrained teacher Transformer, with multi-component loss, small learning rates ( for backbone), and per-layer knowledge distillation on attention masks and task outputs.
Training times and resource footprints are domain-dependent, with physics SEAFormer requiring hours on an A100 GPU per dataset, vehicle routing using up to 66 GB per epoch for VRP-1000 (Esmati et al., 2024, Basharzad et al., 27 Jan 2026, Wan et al., 2023, Lee et al., 2023).
6. Comparative Analysis and Thematic Innovations
SEAFormer frameworks are unified by (1) specialized attention operations achieving linear or near-linear computational complexity, (2) explicit domain-tailored mechanisms (cross-field, edge-aware, or cluster-based) for information exchange, and (3) architectures that retain interpretability or global context through novel fusion or mixing strategies. Each instantiation tightly couples efficiency, scalability, and fidelity—whether for mesh-based PDE surrogates, high-resolution vision, multi-thousand-node VRPs, or memory-limited long-sequence modeling.
In summary, SEAFormer is not a single model but a paradigm, leveraging SEA modules to address the respective computational and representation challenges of its target task, consistently setting benchmarks in performance, interpretability, or scalability across diverse research domains (Esmati et al., 2024, Wan et al., 2023, Basharzad et al., 27 Jan 2026, Lee et al., 2023).