MeshTok: Efficient Multi-Scale Tokenization for Scalable PDE Transformers

Published 3 Jun 2026 in cs.LG and math.NA | (2606.04366v1)

Abstract: Conventional patchified Transformers operate on uniform spatial partitions, distributing computational effort evenly across the domain irrespective of local features. This inflexible tokenization scheme is inherently limited in its ability to efficiently represent and process solutions to complex PDEs. To address this, we propose MeshTok, an adaptive mesh refinement (AMR)-inspired tokenization and sequence modeling framework. This method selectively refines spatial regions exhibiting sharp gradients, transient features, or multiscale structures, generating a heterogeneous set of multiscale tokens defined on a fixed simulation grid. These tokens are processed within a unified Transformer sequence, enabling the model to simultaneously capture coarse-grained global context and fine-grained local details without requiring specialized architectural components. Although adaptive refinement moderately increases token count, it promotes a more targeted allocation of computational resources to physically informative regions, which we view as a practical inductive bias rather than a formal optimality guarantee. Experimental evaluations across multiple PDE families and benchmark datasets demonstrate that MeshTok consistently improves the efficiency-accuracy trade-off compared to uniform-grid baselines. This suggests adaptive multiscale tokenization as a scalable and generalizable design principle for neural PDE modeling. Code is available at https://github.com/SCAILab-USTC/MeshTok.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper’s main contribution is the introduction of an adaptive, activity-based tokenization strategy that minimizes token collisions and enhances PDE solver accuracy.
It leverages AMR-inspired refinement to allocate computational resources to regions of high gradient complexity, significantly reducing runtime and error rates.
Empirical results demonstrate improved L2 error metrics and efficient scalability in both 2D and initial 3D experiments compared to traditional transformer models.

MeshTok: Adaptive Multi-Scale Tokenization for Transformer-Based PDE Modeling

Motivation and Technical Context

Traditional patch-based Transformers in PDE modeling exhibit inefficiencies due to fixed-grid tokenization, uniformly allocating computational resources regardless of local complexity. PDE solutions commonly display sharp spatial heterogeneities, with smooth domains coexisting alongside turbulent or multiscale regions requiring higher resolution. Uniform dense tokenization leads to quadratic scaling in attention cost and representation overkill in low-complexity areas, limiting practical scalability.

MeshTok addresses these bottlenecks by drawing from adaptive mesh refinement (AMR) principles. It adaptively partitions input fields into multi-scale tokens—refining the spatial resolution only in regions exhibiting high activity-induced complexity (e.g., sharp gradients, transient structures, and multiscale features). Critically, MeshTok operates on a fixed simulation grid, enhancing compatibility with existing PDE datasets and standard Transformer architectures. The unified token sequence, composed of coarse and fine tokens, is fed into a Transformer backbone to capture both global context and localized phenomena.

Model Architecture

MeshTok is structured as a two-stage pipeline:

Indicator-Guided Refinement: An activity-based indicator, computed via a combination of local gradient magnitude and Laplacian energy, scores each coarse patch for refinement. A refinement ratio $k$ selects the top- $k$ patches for spatial subdivision, typically into $2\times2$ fine tokens in 2D or $2\times2\times2$ in 3D. The indicator is recalculated during autoregressive rollout, enabling dynamic adaptivity.
Unified Transformer Backbone: Both coarse and fine tokens are embedded and merged into a single sequential representation. A geometry-aware, resolution-sensitive FiLM-based positional encoding incorporating spatial coordinates and refinement depth stabilizes optimization and maintains cross-resolution locality. Block-causal self-attention is enforced along the temporal axis, preserving autoregressive structure while enabling dense spatial attention.

Tokens are decoded to reconstruct coarse and fine spatial patches, which are subsequently fused via a lightweight CNN. This ensures global coherence while allowing fine-scale corrections in locally refined regions.

Theoretical Analysis

MeshTok's theoretical underpinning highlights the representational benefits of AMR-inspired tokenization. Lower bounds indicate that coarse tokenization inevitably discards information relevant for distinguishing high-complexity regions, independently of backbone capacity. Theorems establish that AMR refinement strictly enlarges the realizable function class under compatible encoder-decoder assumptions. If the refinement indicator aligns with local complexity density, MeshTok approaches the optimal approximation error attainable for a given token budget. Efficiency analyses reveal that partial AMR (e.g., $k=0.25$ ) reduces self-attention cost to $\approx 19\%$ of full fine-tokenization, with negligible loss in accuracy.

Experimental Evaluation

Multi-Task Generalization

MeshTok is pretrained across diverse PDE benchmarks (PDEBench, PDENNEval, The Well) normalized to 128×128 resolution with channel padding and noise augmentation. It consistently achieves the lowest relative $\ell_2$ errors on challenging parametric fluid dynamics and reaction-diffusion tasks, outperforming uniform patch-based Transformer baselines (ViT, DPOT, MPP), neural operators (DeepONet, FNO), and recent PDE foundation models (MoE-POT, BCAT). Improvements are attributed to multi-resolution encoding, block-causal temporal self-attention, and adaptive refinement.

Downstream Fine-Tuning and Transfer Learning

MeshTok demonstrates robust transferability via downstream fine-tuning on unseen PDE families (e.g., Black-Scholes-Barenblatt, Reaction-Diffusion, Shear Flow). Pretraining yields significant reductions in sample complexity and error, generalizing more effectively than BCAT, especially in regimes with moderate distribution shifts.

Scaling and Efficiency

Scaling studies across SMALL, BIG, and LARGE configurations validate that AMR refinement gains persist independent of model capacity. MeshTok’s partial refinement regime captures most of the accuracy benefit from full tokenization, with favorable runtime profiles—e.g., achieving near full-refinement accuracy at less than one-third the runtime cost. Compute-matched comparisons confirm that adaptive token allocation, not solely increased token count or parameterization, drives improved efficiency-accuracy trade-offs.

3D Extension

A supplementary 3D MHD experiment corroborates MeshTok’s accuracy-efficiency gains. In this context, partial refinement is three times faster than full refinement, while securing a significant decrease in relative error, demonstrating scalability beyond 2D structured grids.

Ablation and Robustness

MeshTok’s refinement strategy is robust to random initialization, hyperparameters, and indicator choice. Activity-based and a posteriori error estimators attain comparable prediction accuracy, with activity-based refinement preferred for computational efficiency. FiLM-based positional encoding outperforms learnable and sinusoidal alternatives across both one-step and long-horizon rollouts.

Limitations and Future Directions

MeshTok's current design targets structured Cartesian grids, making extension to unstructured meshes and irregular geometries nontrivial. Uniform refinement is suboptimal for small architectures due to capacity constraints, while large models exhibit diminishing marginal gains from adaptivity. Future research could focus on scale-aware refinement hierarchies, optimizing refinement jointly with backbone capacity, and incorporating uncertainty-aware refinement policies for safety-critical modeling.

Practical and Theoretical Implications

MeshTok establishes adaptive multi-scale tokenization as a scalable inductive bias for neural PDE modeling, demonstrating that selectively allocating computational resources accelerates inference and improves accuracy under fixed compute budgets. Its generalizable principle could benefit high-fidelity surrogate simulation in scientific domains such as climate modeling, materials science, and engineering design, where repeated PDE solves are computationally burdensome.

Theoretically, MeshTok’s AMR-inspired adaptivity augments the representational class supported by Transformer-based neural operators, providing formal guarantees regarding function space coverage under budgeted refinement. Its efficiency attenuation effect on attention computation paves the way for more scalable PDE foundation models.

Conclusion

MeshTok proposes a systematic multi-scale tokenization framework integrating adaptive mesh refinement with Transformer backbones for scalable PDE operator learning (2606.04366). By combining activity-guided spatial refinement, coherent positional encoding, and block-causal temporal modeling, MeshTok achieves compelling trade-offs between efficiency and predictive accuracy across heterogeneous PDE benchmarks and downstream transfer settings. Its design supports rapid, accurate surrogate simulation, scalable pretraining, and favorable generalization properties under diverse physical regimes. Extensions to irregular meshes, multilevel refinement, and uncertainty-sensitive modeling remain fruitful directions for advancing adaptive neural PDE solvers.