On the training of physics-informed neural operators for solving parametric partial differential equations

Published 4 Jun 2026 in cs.LG and physics.comp-ph | (2606.06164v1)

Abstract: Physics-informed neural operators (PINOs) aim to learn solution operators for partial differential equations by using the governing physics as supervision, rather than relying solely on paired input-output simulation data. By incorporating physical constraints into the training objective, PINOs combine the cross-instance generalization of neural operators with the data efficiency of physics-informed learning. Despite this promise, how to train PINOs efficiently and robustly remains less well-understood than the training of either data-driven neural operators or physics-informed neural networks (PINNs). To bridge this gap, we examine key components of the PINO training pipeline, including architecture design, optimizer choice, loss balancing, and collocation-point sampling strategy. We study three representative operator backbones, Deep Operator Network (DeepONet), Fourier Neural Operator (FNO), and Continuous Vision Transformer (CViT), across five diverse parametric PDE systems. Our results show that CViT provides consistently strong and stable performance across the considered benchmarks. Beyond architecture, we find that several optimization pathologies previously identified in PINN training naturally arise in PINOs, including gradient conflicts and causal violation. We also find that mitigation algorithms developed for PINNs remain effective in the PINO setting. We further compare physics-informed and data-driven training under different data regimes, revealing that a carefully designed physics-informed training pipeline can match, and in some cases, outperform purely data-driven neural operators. Taken together, these findings provide a systematic empirical understanding of the optimization challenges in PINO training and inform a practical pipeline for efficient and robust physics-informed operator learning. Code and data are available at https://github.com/NanxiiChen/PI-CViT.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper demonstrates that adaptive loss weighting and causal strategies improve convergence in training PINOs by balancing gradient conflicts.
It shows that transformer-based architectures, notably PI-CViT, significantly reduce errors across diverse PDE benchmarks compared to traditional methods.
The study provides practical recommendations and optimizer enhancements, enabling effective operator learning even without large-scale supervised data.

Physics-Informed Neural Operators for Parametric PDEs: Training Challenges and Solutions

Introduction

Physics-informed neural operators (PINOs) integrate operator learning and physics-informed machine learning for parametric PDEs, coupling the generalization capability of neural operators with the data-efficiency and physical fidelity of physics-informed training. The paper "On the training of physics-informed neural operators for solving parametric partial differential equations" (2606.06164) systematically investigates the principled training of PINOs: neural operator models, such as DeepONet, Fourier Neural Operator (FNO), and transformer-based architectures (CViT), trained using losses derived from the governing physical equations, with or without labeled solution data. This work dissects the interaction of model design, optimization algorithms, loss balancing, and collocation strategies, yielding both practical recommendations and empirical insights on the optimization pathologies unique to PINO training.

Formulation and Training Pipeline

PINOs are designed to learn mappings from parameterized input spaces (initial conditions, coefficients, or other parametric fields) to the solution spaces of PDEs, parameterized by governing equations. The core training objective is a composite loss comprising the PDE residual, initial and boundary condition penalties, and optionally supervised data losses, each evaluated at sampled points in the spatiotemporal/parametric domain.

Figure 1: The overall PINO training pipeline, integrating a transformer-based encoder-decoder, composite physics losses with gradient- and causality-based weighting, and the SOAP optimizer.

PI-CViT, the transformer-based neural operator, emerges as the most performant backbone. Its design, shown together with DeepONet and FNO for comparison, provides flexible evaluation at arbitrary query coordinates via a vision transformer encoder and cross-attention decoder.

Figure 2: Architectural comparison of DeepONet, FNO, and CViT in the context of physics-informed operator learning.

Optimization Pathologies and Mitigation Strategies

The composite multi-objective PINO loss presents significant optimization challenges, notably gradient imbalance across loss terms (e.g., between residual and initial/boundary losses), temporal causality violations in unsteady problems, and gradient conflicts across objectives.

Gradient Imbalance: The $L^2$ norms of the loss gradients in unweighted PINO training are heavily skewed, impeding convergence and inducing solution bias. Adaptive weighting schemes such as GradNorm balance the gradient magnitudes in each loss component, empirically demonstrated to yield more stable and equitable training dynamics.

Figure 3: Gradient norm trajectories for unweighted vs. GradNorm-weighted loss terms, illustrated for the shallow water PINO benchmark.

Causal Violation: Without explicit temporal ordering, training may prioritize late-time prediction accuracy before short-time characteristics are properly resolved. Causal weighting schemes, partitioning time and assigning progressive weights, drive the optimizer to resolve early dynamics prior to late-time corrections.

Figure 4: PDE residual loss by time segment under causal weighting, demonstrating sequential temporal error resolution.

Gradient Conflict: Inter-loss gradient misalignment is a further obstacle for first-order optimizers (e.g., Adam), causing inefficient parameter updates. The application of second-order-adaptive optimizers, notably SOAP, which precondition the update direction via curvature information, statistically improves gradient alignment and convergence.

Empirical Results and Ablation Insights

Architecture Comparisons

Across five canonical PDE benchmarks—Burgers', wave, shallow water, ice melting, and lid-driven cavity—PI-CViT robustly achieves the lowest error and tightest generalization bounds, with especially stark improvements on nonlinear advection-diffusion and stiff phase-field problems.

Figure 5: Distribution of relative $L^2$ errors across 100 test samples for each operator architecture and PDE benchmark.

Qualitative solution analysis further demonstrates that PI-CViT maintains solution coherence for all predicted fields, even at late times or with stiff dynamics, where other backbones accumulate significant error and structural artifacts.

Figure 6: Burgers' equation—comparison of predicted velocity component $v_1$ across neural operator backbones.

Figure 7: Wave equation—predicted displacement field for each operator and ground truth.

Figure 8: Shallow water equations—predicted free surface height $h$ at key timepoints.

Figure 9: Ice melting (Allen–Cahn)—phase-field interface evolution for each neural operator, showing sharper accuracy for PI-CViT.

Figure 10: Lid-driven cavity—streamlines and velocity magnitude, with failures of FNO on non-periodic boundaries.

Ablation Studies

Optimizer Effect: SOAP consistently outperforms Adam on all but the stiffest phase-field problems, with substantial reductions in test error.

Figure 11: Test error convergence for Adam vs. SOAP across architectures and benchmarks.

Loss Weighting and Causality: Only the full pipeline—combining both GradNorm and causal weighting—consistently achieves minimum final error. Removal of either component yields pronounced degradation in convergence or solution accuracy, depending on the problem's structure.

Figure 12: Ablation of GradNorm and causal weighting in PI-CViT training.

Collocation and Data Mixing: Purely physics-informed training with free collocation sampling typically matches or outperforms supervised (data-driven) regimes, even at large labeled data scales. Simply combining physics and data-driven losses does not automatically yield better results, and can sometimes degrade performance due to unresolved gradient conflicts.

Figure 13: Relative $L^2$ error across training regimes and data scales for several problems, highlighting the key role of free collocation.

Gradient Computation: Automatic differentiation (AD) for residual evaluation outperforms fixed-grid finite difference (FD) or spectral residual computations, both in accuracy and computational efficiency, especially in problems where temporal resolution is critical.

Figure 14: Test error for PI-CViT under AD vs. FD residual evaluation on two benchmarks.

Additional Practical Strategies

Hard-encoding periodicity in decoder coordinate embeddings is critical for high-fidelity periodic PDE solutions.
Time Feature-wise Linear Modulation (FiLM) improves decoding in scenarios where time serves as a global modulator of a fixed spatial template (e.g., interface motion, wave propagation), but not in highly space-time-coupled systems.
Initial training should "warm up" with lower resampling rates of collocation/input points to avoid optimizer collapse at $t=0$ .
Combining physics losses and labeled data requires careful scheduling or additional conflict-mitigation—straightforward additive loss schedules are often counterproductive.

Implications, Contradictory Claims, and Outlook

Strong Numerical Results:

On Burgers', PI-CViT achieves $0.78\%$ test error, outperforming PI-FNO ( $9.67\%$ ) and PI-DeepONet ( $34.5\%$ ), a factor of $>12\times$ and $L^2$ 0 improvement, respectively.
On the Allen–Cahn phase-field (ice melting) problem, PI-CViT yields a $L^2$ 1 error, while PI-FNO and PI-DeepONet reach $L^2$ 2 and $L^2$ 3.

Bold Claims:

A well-resolved, purely physics-informed training pipeline with free collocation sampling matches or surpasses data-driven neural operator training even at large supervised data scales.
Adding labeled data to physics-informed pipelines does not necessarily improve performance; indeed, it may degrade accuracy due to gradient conflicts unless conflicts are explicitly resolved.

Contradictory Observations:

The common belief that physics-informed and supervised (data-driven) losses can be beneficially combined additively is not generally supported; hybrid losses often require explicit mitigation of underlying optimization pathologies.

Implications and Future Directions:

For scientific computing, robust data-free operator learning enables deployment in settings where labeled solutions are unavailable or expensive, significantly broadening the applicability of neural operators.
There is a need for further algorithmic developments that coordinate the optimization of physics and data-driven objectives, potentially leveraging task-level decoupling, curriculum schedules, or conflict-free gradient projections.
Extending PINOs to irregular domains, non-Euclidean geometries, and high-dimensional problems will require more geometry-aware backbones, sampling designs, and novel physical constraint enforcement techniques.
For PDEs with strong multi-field coupling, interfaces, or conserve quantites (e.g., turbulence, phase-field fracture), variational or energy-based loss formulations may further enhance physical fidelity and solver robustness.

Conclusion

This work provides a comprehensive empirical study of PINO training, identifying critical factors impacting performance and convergence. The results establish transformer-based architectures, advanced optimizer schemes, and adaptive weighting as central to achieving high accuracy and robust optimization for physics-informed operator learning on challenging parametric PDEs. The findings challenge simple hybridization of physics and data-driven losses, emphasizing the nuanced design required for high-fidelity scientific ML solvers. Progress in this domain underpins advances in simulation-based design, control, and discovery in scientific applications, and will catalyze further developments in the theory and practice of operator learning.

Markdown Report Issue