Papers
Topics
Authors
Recent
Search
2000 character limit reached

Transolver++ Neural PDE Solver

Updated 23 February 2026
  • Transolver++ is a highly parallel neural solver that efficiently resolves PDEs on meshes with up to millions of points using advanced local-adaptive state aggregation.
  • It introduces an adaptive temperature mechanism, Gumbel-Softmax reparameterization, and feedforward collapse to optimize memory use and improve predictive accuracy.
  • Empirical evaluations demonstrate significant error reductions and near-linear scaling with GPUs in both standard benchmarks and industrial CFD simulations.

Transolver++ is a highly parallel and efficient neural solver designed to accurately resolve partial differential equations (PDEs) on geometries with up to millions of mesh points. Addressing key obstacles in industrial-scale simulation, Transolver++ extends the Transolver architecture by introducing a local-adaptive mechanism for state aggregation, an optimized parallelism framework, and architectural memory reductions. These contributions enable the model to scale to mesh sizes two orders of magnitude larger than previously possible and to achieve superior predictive accuracy on both standard PDE benchmarks and real-world computational fluid dynamics (CFD) tasks (Luo et al., 4 Feb 2025).

1. Architectural Innovations

Transolver++ builds upon the “Physics-Attention” approach of Transolver, wherein MM physical states sjRCs_j \in \mathbb{R}^C are learned from NN point features {xiRC}\{x_i \in \mathbb{R}^C\} using slice weights wijw_{ij}: wij=Softmaxj(WsxiT0),sj=i=1Nwijxii=1Nwijw_{ij} = \mathrm{Softmax}_j\left(\frac{W_s x_i}{T_0}\right),\quad s_j = \frac{\sum_{i=1}^N w_{ij}\, x_i}{\sum_{i=1}^N w_{ij}} Standard attention is then applied to the states, followed by deslicing to obtain updated point features.

Scaling Transolver directly to N106N\geq10^6 introduces degeneration (uniform slice weights akin to pooling) and memory bottlenecks, as storing per-point embeddings becomes prohibitive. Transolver++ introduces three key architectural elements:

  1. Local-Adaptive Temperature (“Ada-Temp”): Each point learns its own Softmax temperature τi=T0+LinearT(xi)\tau_i=T_0+\mathrm{Linear}_T(x_i). The slice weight computation is adapted with the Gumbel-Softmax trick:

wij=Softmaxj(Wsxilog(logϵij)τi),ϵijUniform(0,1)w_{ij} = \mathrm{Softmax}_j\left(\frac{W_s x_i - \log(-\log \epsilon_{ij})}{\tau_i}\right),\quad \epsilon_{ij} \sim \operatorname{Uniform}(0,1)

  1. Slice Reparameterization (Gumbel-Softmax): This ensures sharper, locally-adaptive (“eidetic”) physical states, enabling the model to represent rapid or slow physics variations by adjusting slice assignment sharpness per region.
  2. Feedforward Collapse: Unlike Transolver, which used two projections x{x,f(x)}x \rightarrow \{x, f(x)\}, Transolver++ eliminates f(x)f(x), reducing per-point memory by half without loss of accuracy.

A fully optimized parallelism and data partitioning infrastructure is implemented, as detailed in the next section.

2. Parallelism and Scalability

Transolver++ achieves high parallel efficiency via a point-wise parallel data distribution and communication-optimized aggregation. The computation for each Physics-Attention block per GPU kk proceeds as follows:

1
2
3
4
5
6
7
8
τ^(k)  T0 + Linear_T(x^(k))                     # Ada-Temp
w^(k)  GumbelSoftmax((Linear_s(x^(k)))/τ^(k))   # [N_k × M]
n^(k)  w^(k).T × x^(k)                         # [M × C]
d^(k)  sum over rows of w^(k)                   # [M]
n  AllReduce_sum(n^(k)); d  AllReduce_sum(d^(k)) # [M × C], [M]
s  n / d[:,None]                                # [M × C]
s' ← Attention(s)                                # [M × C]
x'^(k) ← w^(k) × s'                              # [N_k × C]

  • Communication: Each layer requires only O(MC+M)O(MC+M) all-reduce operations per GPU, independent of NN.
  • Load-Balancing: The mesh points are partitioned uniformly, and all necessary weights and features are co-located.
  • Comparison: Traditional tensor/model parallelism and ring-attention require O(N)O(N) or O(N2)O(N^2) communication, in contrast to the O(MC)O(MC) pattern in Transolver++.

This design supports nearly linear scaling with number of GPUs, as demonstrated empirically up to 32 GPUs.

3. Mathematical Formulation and Complexity

Transolver++ is applicable to a class of steady and time-dependent PDEs, including:

  • Navier–Stokes (2D vorticity),
  • Euler (transonic over airfoils),
  • Stokes/Darcy (elliptic porous flow),
  • Linear/nonlinear elasticity,
  • Plastic forging,
  • Incompressible pipe flow.

States and updates are formalized as follows for X=[x1;;xN]RN×CX=[x_1;\ldots;x_N] \in \mathbb{R}^{N\times C}:

  • Ada-Temp: τi=T0+WTxi\tau_i = T_0 + W_T x_i
  • Gumbel-Softmax Slice:

wij=exp((Wsxiγij)/τi)=1Mexp((Wsxiγi)/τi),γij=log(logϵij),  ϵijUniform(0,1)w_{ij} = \frac{\exp((W_s x_i - \gamma_{ij})/\tau_i)}{\sum_{\ell=1}^M \exp((W_s x_i - \gamma_{i\ell})/\tau_i)},\quad \gamma_{ij} = -\log(-\log \epsilon_{ij}),\; \epsilon_{ij} \sim \mathrm{Uniform}(0,1)

  • State Aggregation: sj=i=1Nwijxii=1Nwijs_j = \frac{\sum_{i=1}^N w_{ij} x_i}{\sum_{i=1}^N w_{ij}}
  • Attention: s=Softmax(QKT/C)Vs' = \mathrm{Softmax}(QK^T/\sqrt{C}) V
  • Deslice: X^=Ws\hat{X} = W s'
  • Loss: Relative L2 for standard PDEs; task-specific composite losses for industrial cases.

Complexity per GPU:

  • Pointwise and slice: O(NkC2)O(N_k C^2)
  • State aggregation/local: O(NkMC)O(N_k M C); communication: O(MC)O(MC)
  • Overall memory: O(NkC+MC)O(N_k C + MC)
  • Linear total runtime and memory with global NN

4. Local Adaptive Mechanism and Eidetic States

Transolver++'s Ada-Temp module adapts each mesh point’s slice assignment sharpness:

  • Small τi\tau_i in fast-varying physics sharpens wiw_{i\cdot}, enabling representations over multiple distinct states.
  • Large τi\tau_i in slow-varying regions smooths assignments, effectively pooling over similar states.

Empirical analysis reveals a 10–100× increase in the Kullback–Leibler divergence between wiw_{i\cdot} and the uniform distribution compared to Transolver, confirming highly non-uniform, locally-adaptive slice behavior. This enables the solver to maintain fidelity in regions with sharp physical transitions while leveraging smoothing where appropriate.

5. Empirical Evaluation

5.1 Standard PDE Benchmarks

Transolver++ is evaluated on six standard benchmarks, showing consistent improvements in relative L2 error compared to both Transolver and GNOT:

Benchmark GNOT Transolver Transolver++ % Improvement
Elasticity 0.0086 0.0064 0.0052 –18.8%
Plasticity 0.0336 0.0013 0.0011 –15.3%
Airfoil 0.0076 0.0053 0.0048 –10.2%
Pipe 0.0047 0.0033 0.0027 –12.9%
NS2D 0.1380 0.0900 0.0719 –13.4%
Darcy 0.0105 0.0058 0.0049 –12.3%

The average relative promotion is 13%.

5.2 Million-Scale Industrial Simulation

On high-fidelity CFD meshes for car (DrivAerNet++) and aircraft:

  • Mesh sizes: $2.5$M (DrivAerNet++ Full) and $0.3$M (Aircraft)
  • Hardware: 4 × NVIDIA A100 40 GB GPUs
  • Batch runtime: 10–15% faster per batch than baselines
  • Transolver++ reduces surface/volume L2 error, drag, and lift coefficient errors, up to –62.2% (lift, AirCraft) and –41.0% (drag, DrivA++ Full)
Dataset Metric Transolver Transolver++ Rel. Promotion
DrivA++ Full Vol. L2 0.173 0.154 –11.0%
Surf L2 0.167 0.146 –12.6%
Drag error 0.061 0.036 –41.0%
DrivA++ Surface Surf L2 0.145 0.110 –24.1%
Lift error 0.037 0.014 –62.2%
AirCraft (~300K) Surf L2 0.145 0.110 –24.1%
Lift error 0.037 0.014 –62.2%

6. Experimental Design and Infrastructure

  • Data: Standard meshes (Li et al., Wu et al.) and high-fidelity CFD meshes for industrial tasks.
  • Hyperparameters: AdamW optimizer, learning rate 10310^{-3}, weight decay $0.01$, 8 layers (benchmarks), 4 layers (industrial), C=128C=128 or $256$, M=32M=32 or $64$ slices, 4 attention heads, batch size 8 (benchmarks) or 1–2 (industrial).
  • Software: PyTorch with NCCL AllReduce, mixed precision (FP16/TF32), tested on NVIDIA A100 40 GB GPUs.
  • Baselines: Non-million-scale-capable baselines are compared on subsampled $50$K-point data; Transolver++ runs on full meshes.

7. Limitations and Prospects

  • Extension to real-time and long-time integration of time-dependent PDEs remains open.
  • The fixed number of slices MM may be refined by learning MM dynamically.
  • Lessened reliance on high-cost CFD data through PINN-style loss functions is identified as a future direction.
  • Current parallelism assumes homogeneous hardware; model parallel extensions could address scenarios with even larger channel dimensions.
  • Handling fluid–structure interactions and complex moving boundaries is yet to be addressed.

Transolver++ serves as a scalable backbone for neural PDE solvers, advancing the field toward industrial-grade, foundation-level computational physics architectures by delivering linear scaling, significant error reduction, and backbone suitability for future research (Luo et al., 4 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Transolver++.