Transolver++ Neural PDE Solver

Updated 23 February 2026

Transolver++ is a highly parallel neural solver that efficiently resolves PDEs on meshes with up to millions of points using advanced local-adaptive state aggregation.
It introduces an adaptive temperature mechanism, Gumbel-Softmax reparameterization, and feedforward collapse to optimize memory use and improve predictive accuracy.
Empirical evaluations demonstrate significant error reductions and near-linear scaling with GPUs in both standard benchmarks and industrial CFD simulations.

Transolver++ is a highly parallel and efficient neural solver designed to accurately resolve partial differential equations (PDEs) on geometries with up to millions of mesh points. Addressing key obstacles in industrial-scale simulation, Transolver++ extends the Transolver architecture by introducing a local-adaptive mechanism for state aggregation, an optimized parallelism framework, and architectural memory reductions. These contributions enable the model to scale to mesh sizes two orders of magnitude larger than previously possible and to achieve superior predictive accuracy on both standard PDE benchmarks and real-world computational fluid dynamics (CFD) tasks (Luo et al., 4 Feb 2025).

1. Architectural Innovations

Transolver++ builds upon the “Physics-Attention” approach of Transolver, wherein $M$ physical states $s_j \in \mathbb{R}^C$ are learned from $N$ point features $\{x_i \in \mathbb{R}^C\}$ using slice weights $w_{ij}$ : $w_{ij} = \mathrm{Softmax}_j\left(\frac{W_s x_i}{T_0}\right),\quad s_j = \frac{\sum_{i=1}^N w_{ij}\, x_i}{\sum_{i=1}^N w_{ij}}$ Standard attention is then applied to the states, followed by deslicing to obtain updated point features.

Scaling Transolver directly to $N\geq10^6$ introduces degeneration (uniform slice weights akin to pooling) and memory bottlenecks, as storing per-point embeddings becomes prohibitive. Transolver++ introduces three key architectural elements:

Local-Adaptive Temperature (“Ada-Temp”): Each point learns its own Softmax temperature $\tau_i=T_0+\mathrm{Linear}_T(x_i)$ . The slice weight computation is adapted with the Gumbel-Softmax trick:

$w_{ij} = \mathrm{Softmax}_j\left(\frac{W_s x_i - \log(-\log \epsilon_{ij})}{\tau_i}\right),\quad \epsilon_{ij} \sim \operatorname{Uniform}(0,1)$

Slice Reparameterization (Gumbel-Softmax): This ensures sharper, locally-adaptive (“eidetic”) physical states, enabling the model to represent rapid or slow physics variations by adjusting slice assignment sharpness per region.
Feedforward Collapse: Unlike Transolver, which used two projections $x \rightarrow \{x, f(x)\}$ , Transolver++ eliminates $f(x)$ , reducing per-point memory by half without loss of accuracy.

A fully optimized parallelism and data partitioning infrastructure is implemented, as detailed in the next section.

2. Parallelism and Scalability

Transolver++ achieves high parallel efficiency via a point-wise parallel data distribution and communication-optimized aggregation. The computation for each Physics-Attention block per GPU $k$ proceeds as follows:

τ^(k) ← T0 + Linear_T(x^(k))                     # Ada-Temp
w^(k) ← GumbelSoftmax((Linear_s(x^(k)))/τ^(k))   # [N_k × M]
n^(k) ← w^(k).T × x^(k)                         # [M × C]
d^(k) ← sum over rows of w^(k)                   # [M]
n ← AllReduce_sum(n^(k)); d ← AllReduce_sum(d^(k)) # [M × C], [M]
s ← n / d[:,None]                                # [M × C]
s' ← Attention(s)                                # [M × C]
x'^(k) ← w^(k) × s'                              # [N_k × C]

Communication: Each layer requires only $O(MC+M)$ all-reduce operations per GPU, independent of $N$ .
Load-Balancing: The mesh points are partitioned uniformly, and all necessary weights and features are co-located.
Comparison: Traditional tensor/model parallelism and ring-attention require $O(N)$ or $O(N^2)$ communication, in contrast to the $O(MC)$ pattern in Transolver++.

This design supports nearly linear scaling with number of GPUs, as demonstrated empirically up to 32 GPUs.

3. Mathematical Formulation and Complexity

Transolver++ is applicable to a class of steady and time-dependent PDEs, including:

Navier–Stokes (2D vorticity),
Euler (transonic over airfoils),
Stokes/Darcy (elliptic porous flow),
Linear/nonlinear elasticity,
Plastic forging,
Incompressible pipe flow.

States and updates are formalized as follows for $X=[x_1;\ldots;x_N] \in \mathbb{R}^{N\times C}$ :

Ada-Temp: $\tau_i = T_0 + W_T x_i$
Gumbel-Softmax Slice:

$w_{ij} = \frac{\exp((W_s x_i - \gamma_{ij})/\tau_i)}{\sum_{\ell=1}^M \exp((W_s x_i - \gamma_{i\ell})/\tau_i)},\quad \gamma_{ij} = -\log(-\log \epsilon_{ij}),\; \epsilon_{ij} \sim \mathrm{Uniform}(0,1)$

State Aggregation: $s_j = \frac{\sum_{i=1}^N w_{ij} x_i}{\sum_{i=1}^N w_{ij}}$
Attention: $s' = \mathrm{Softmax}(QK^T/\sqrt{C}) V$
Deslice: $\hat{X} = W s'$
Loss: Relative L2 for standard PDEs; task-specific composite losses for industrial cases.

Complexity per GPU:

Pointwise and slice: $O(N_k C^2)$
State aggregation/local: $O(N_k M C)$ ; communication: $O(MC)$
Overall memory: $O(N_k C + MC)$
Linear total runtime and memory with global $N$

4. Local Adaptive Mechanism and Eidetic States

Transolver++'s Ada-Temp module adapts each mesh point’s slice assignment sharpness:

Small $\tau_i$ in fast-varying physics sharpens $w_{i\cdot}$ , enabling representations over multiple distinct states.
Large $\tau_i$ in slow-varying regions smooths assignments, effectively pooling over similar states.

Empirical analysis reveals a 10–100× increase in the Kullback–Leibler divergence between $w_{i\cdot}$ and the uniform distribution compared to Transolver, confirming highly non-uniform, locally-adaptive slice behavior. This enables the solver to maintain fidelity in regions with sharp physical transitions while leveraging smoothing where appropriate.

5. Empirical Evaluation

5.1 Standard PDE Benchmarks

Transolver++ is evaluated on six standard benchmarks, showing consistent improvements in relative L2 error compared to both Transolver and GNOT:

Benchmark	GNOT	Transolver	Transolver++	% Improvement
Elasticity	0.0086	0.0064	0.0052	–18.8%
Plasticity	0.0336	0.0013	0.0011	–15.3%
Airfoil	0.0076	0.0053	0.0048	–10.2%
Pipe	0.0047	0.0033	0.0027	–12.9%
NS2D	0.1380	0.0900	0.0719	–13.4%
Darcy	0.0105	0.0058	0.0049	–12.3%

The average relative promotion is 13%.

5.2 Million-Scale Industrial Simulation

On high-fidelity CFD meshes for car (DrivAerNet++) and aircraft:

Mesh sizes: $2.5$M (DrivAerNet++ Full) and $0.3$M (Aircraft)
Hardware: 4 × NVIDIA A100 40 GB GPUs
Batch runtime: 10–15% faster per batch than baselines
Transolver++ reduces surface/volume L2 error, drag, and lift coefficient errors, up to –62.2% (lift, AirCraft) and –41.0% (drag, DrivA++ Full)

Dataset	Metric	Transolver	Transolver++	Rel. Promotion
DrivA++ Full	Vol. L2	0.173	0.154	–11.0%
	Surf L2	0.167	0.146	–12.6%
	Drag error	0.061	0.036	–41.0%
DrivA++ Surface	Surf L2	0.145	0.110	–24.1%
	Lift error	0.037	0.014	–62.2%
AirCraft (~300K)	Surf L2	0.145	0.110	–24.1%
	Lift error	0.037	0.014	–62.2%

6. Experimental Design and Infrastructure

Data: Standard meshes (Li et al., Wu et al.) and high-fidelity CFD meshes for industrial tasks.
Hyperparameters: AdamW optimizer, learning rate $10^{-3}$ , weight decay $0.01$, 8 layers (benchmarks), 4 layers (industrial), $C=128$ or $256$, $M=32$ or $64$ slices, 4 attention heads, batch size 8 (benchmarks) or 1–2 (industrial).
Software: PyTorch with NCCL AllReduce, mixed precision (FP16/TF32), tested on NVIDIA A100 40 GB GPUs.
Baselines: Non-million-scale-capable baselines are compared on subsampled $50$K-point data; Transolver++ runs on full meshes.

7. Limitations and Prospects

Extension to real-time and long-time integration of time-dependent PDEs remains open.
The fixed number of slices $M$ may be refined by learning $M$ dynamically.
Lessened reliance on high-cost CFD data through PINN-style loss functions is identified as a future direction.
Current parallelism assumes homogeneous hardware; model parallel extensions could address scenarios with even larger channel dimensions.
Handling fluid–structure interactions and complex moving boundaries is yet to be addressed.

Transolver++ serves as a scalable backbone for neural PDE solvers, advancing the field toward industrial-grade, foundation-level computational physics architectures by delivering linear scaling, significant error reduction, and backbone suitability for future research (Luo et al., 4 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Transolver++: An Accurate Neural Solver for PDEs on Million-Scale Geometries (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Transolver++.