Transolver++ Neural PDE Solver
- Transolver++ is a highly parallel neural solver that efficiently resolves PDEs on meshes with up to millions of points using advanced local-adaptive state aggregation.
- It introduces an adaptive temperature mechanism, Gumbel-Softmax reparameterization, and feedforward collapse to optimize memory use and improve predictive accuracy.
- Empirical evaluations demonstrate significant error reductions and near-linear scaling with GPUs in both standard benchmarks and industrial CFD simulations.
Transolver++ is a highly parallel and efficient neural solver designed to accurately resolve partial differential equations (PDEs) on geometries with up to millions of mesh points. Addressing key obstacles in industrial-scale simulation, Transolver++ extends the Transolver architecture by introducing a local-adaptive mechanism for state aggregation, an optimized parallelism framework, and architectural memory reductions. These contributions enable the model to scale to mesh sizes two orders of magnitude larger than previously possible and to achieve superior predictive accuracy on both standard PDE benchmarks and real-world computational fluid dynamics (CFD) tasks (Luo et al., 4 Feb 2025).
1. Architectural Innovations
Transolver++ builds upon the “Physics-Attention” approach of Transolver, wherein physical states are learned from point features using slice weights : Standard attention is then applied to the states, followed by deslicing to obtain updated point features.
Scaling Transolver directly to introduces degeneration (uniform slice weights akin to pooling) and memory bottlenecks, as storing per-point embeddings becomes prohibitive. Transolver++ introduces three key architectural elements:
- Local-Adaptive Temperature (“Ada-Temp”): Each point learns its own Softmax temperature . The slice weight computation is adapted with the Gumbel-Softmax trick:
- Slice Reparameterization (Gumbel-Softmax): This ensures sharper, locally-adaptive (“eidetic”) physical states, enabling the model to represent rapid or slow physics variations by adjusting slice assignment sharpness per region.
- Feedforward Collapse: Unlike Transolver, which used two projections , Transolver++ eliminates , reducing per-point memory by half without loss of accuracy.
A fully optimized parallelism and data partitioning infrastructure is implemented, as detailed in the next section.
2. Parallelism and Scalability
Transolver++ achieves high parallel efficiency via a point-wise parallel data distribution and communication-optimized aggregation. The computation for each Physics-Attention block per GPU proceeds as follows:
1 2 3 4 5 6 7 8 |
τ^(k) ← T0 + Linear_T(x^(k)) # Ada-Temp w^(k) ← GumbelSoftmax((Linear_s(x^(k)))/τ^(k)) # [N_k × M] n^(k) ← w^(k).T × x^(k) # [M × C] d^(k) ← sum over rows of w^(k) # [M] n ← AllReduce_sum(n^(k)); d ← AllReduce_sum(d^(k)) # [M × C], [M] s ← n / d[:,None] # [M × C] s' ← Attention(s) # [M × C] x'^(k) ← w^(k) × s' # [N_k × C] |
- Communication: Each layer requires only all-reduce operations per GPU, independent of .
- Load-Balancing: The mesh points are partitioned uniformly, and all necessary weights and features are co-located.
- Comparison: Traditional tensor/model parallelism and ring-attention require or communication, in contrast to the pattern in Transolver++.
This design supports nearly linear scaling with number of GPUs, as demonstrated empirically up to 32 GPUs.
3. Mathematical Formulation and Complexity
Transolver++ is applicable to a class of steady and time-dependent PDEs, including:
- Navier–Stokes (2D vorticity),
- Euler (transonic over airfoils),
- Stokes/Darcy (elliptic porous flow),
- Linear/nonlinear elasticity,
- Plastic forging,
- Incompressible pipe flow.
States and updates are formalized as follows for :
- Ada-Temp:
- Gumbel-Softmax Slice:
- State Aggregation:
- Attention:
- Deslice:
- Loss: Relative L2 for standard PDEs; task-specific composite losses for industrial cases.
Complexity per GPU:
- Pointwise and slice:
- State aggregation/local: ; communication:
- Overall memory:
- Linear total runtime and memory with global
4. Local Adaptive Mechanism and Eidetic States
Transolver++'s Ada-Temp module adapts each mesh point’s slice assignment sharpness:
- Small in fast-varying physics sharpens , enabling representations over multiple distinct states.
- Large in slow-varying regions smooths assignments, effectively pooling over similar states.
Empirical analysis reveals a 10–100× increase in the Kullback–Leibler divergence between and the uniform distribution compared to Transolver, confirming highly non-uniform, locally-adaptive slice behavior. This enables the solver to maintain fidelity in regions with sharp physical transitions while leveraging smoothing where appropriate.
5. Empirical Evaluation
5.1 Standard PDE Benchmarks
Transolver++ is evaluated on six standard benchmarks, showing consistent improvements in relative L2 error compared to both Transolver and GNOT:
| Benchmark | GNOT | Transolver | Transolver++ | % Improvement |
|---|---|---|---|---|
| Elasticity | 0.0086 | 0.0064 | 0.0052 | –18.8% |
| Plasticity | 0.0336 | 0.0013 | 0.0011 | –15.3% |
| Airfoil | 0.0076 | 0.0053 | 0.0048 | –10.2% |
| Pipe | 0.0047 | 0.0033 | 0.0027 | –12.9% |
| NS2D | 0.1380 | 0.0900 | 0.0719 | –13.4% |
| Darcy | 0.0105 | 0.0058 | 0.0049 | –12.3% |
The average relative promotion is 13%.
5.2 Million-Scale Industrial Simulation
On high-fidelity CFD meshes for car (DrivAerNet++) and aircraft:
- Mesh sizes: $2.5$M (DrivAerNet++ Full) and $0.3$M (Aircraft)
- Hardware: 4 × NVIDIA A100 40 GB GPUs
- Batch runtime: 10–15% faster per batch than baselines
- Transolver++ reduces surface/volume L2 error, drag, and lift coefficient errors, up to –62.2% (lift, AirCraft) and –41.0% (drag, DrivA++ Full)
| Dataset | Metric | Transolver | Transolver++ | Rel. Promotion |
|---|---|---|---|---|
| DrivA++ Full | Vol. L2 | 0.173 | 0.154 | –11.0% |
| Surf L2 | 0.167 | 0.146 | –12.6% | |
| Drag error | 0.061 | 0.036 | –41.0% | |
| DrivA++ Surface | Surf L2 | 0.145 | 0.110 | –24.1% |
| Lift error | 0.037 | 0.014 | –62.2% | |
| AirCraft (~300K) | Surf L2 | 0.145 | 0.110 | –24.1% |
| Lift error | 0.037 | 0.014 | –62.2% |
6. Experimental Design and Infrastructure
- Data: Standard meshes (Li et al., Wu et al.) and high-fidelity CFD meshes for industrial tasks.
- Hyperparameters: AdamW optimizer, learning rate , weight decay $0.01$, 8 layers (benchmarks), 4 layers (industrial), or $256$, or $64$ slices, 4 attention heads, batch size 8 (benchmarks) or 1–2 (industrial).
- Software: PyTorch with NCCL AllReduce, mixed precision (FP16/TF32), tested on NVIDIA A100 40 GB GPUs.
- Baselines: Non-million-scale-capable baselines are compared on subsampled $50$K-point data; Transolver++ runs on full meshes.
7. Limitations and Prospects
- Extension to real-time and long-time integration of time-dependent PDEs remains open.
- The fixed number of slices may be refined by learning dynamically.
- Lessened reliance on high-cost CFD data through PINN-style loss functions is identified as a future direction.
- Current parallelism assumes homogeneous hardware; model parallel extensions could address scenarios with even larger channel dimensions.
- Handling fluid–structure interactions and complex moving boundaries is yet to be addressed.
Transolver++ serves as a scalable backbone for neural PDE solvers, advancing the field toward industrial-grade, foundation-level computational physics architectures by delivering linear scaling, significant error reduction, and backbone suitability for future research (Luo et al., 4 Feb 2025).