Knowledge-Distilled PINNs for Real-Time PDE Solvers

Updated 22 December 2025

KD-PINNs are a teacher–student framework that transfers the high physical accuracy of overparameterized models to compact, inference-efficient student networks.
They combine physics-informed losses with MSE-based distillation to achieve sub-10 ms CPU latency while preserving precise PDE solutions.
Empirical benchmarks, such as on Black–Scholes and Burgers equations, show up to 6.9× speedup with minimal RMSE increase, enabling real-time scientific computing.

Knowledge-Distilled Physics-Informed Neural Networks (KD-PINN) are a neural surrogate modeling paradigm for partial differential equations (PDEs) that employs knowledge distillation to train a compact, inference-efficient student network under the simultaneous supervision of physics-informed losses and guidance from a higher-capacity teacher PINN. The core motivation is to transfer the physical accuracy of overparameterized PINNs to smaller models, achieving ultra-low-latency (sub-10 ms) inference suitable for real-time scientific computing, while preserving the underlying physical fidelity.

1. Framework and Methodological Foundations

The KD-PINN framework is structured around a teacher–student paradigm tailored to the physics-informed neural network (PINN) context (Bounja et al., 15 Dec 2025). A teacher PINN—a high-capacity, fully connected neural network—is first trained by minimizing a composite objective comprising PDE residual losses and, where applicable, observational/initial/boundary value losses: $L_{\rm Teacher}(\theta_T) = L_{\rm PDE}(\theta_T) + \lambda_{\rm BC} L_{\rm BC}(\theta_T) + \lambda_T L_T(\theta_T)$ where, for interior collocation points $\{x_i\}_{i=1}^{N_f}$ ,

$L_{\rm PDE}(\theta) = \frac{1}{N_f}\sum_{i=1}^{N_f} \left| \mathcal{N}[u_\theta](x_i) \right|^2$

and the data loss and boundary/terminal condition terms parallel standard PINN formulations.

A reduced-complexity student PINN is then trained to jointly minimize: $L_{\rm Student}(\phi) = \lambda_{\rm PDE} L_{\rm PDE}(\phi) + \lambda_{\rm data} L_{\rm data}(\phi) + \lambda_{\rm KD} L_{\rm KD}(\phi)$ The knowledge distillation loss is given by a mean-squared error (MSE) between student and teacher outputs at randomly sampled auxiliary locations: $L_{\rm KD}(\phi) = \frac{1}{N_k} \sum_{k=1}^{N_k} |u_S(x_k; \phi) - u_T(x_k; \theta_T)|^2$ Alternative formulations using soft targets (KL divergence or tempered softmaxes) are also possible, but the MSE surrogate is found appropriate in all tested PDEs. This loss provides direct function value supervision, complementing the sometimes sparse or high-variance signals from physics-based gradients.

Hyperparameter schedules (for distillation weight $\lambda_{\rm KD}$ or temperature $T$ ) may be employed, and curriculum learning strategies (e.g., ramping physics-loss weight) are shown to improve student extrapolation in out-of-distribution settings (KD-PINN $^+$ variant).

2. Implementation and Training Protocols

Standard settings employ plain multilayer perceptrons with tanh or SiLU nonlinearities. For example, in the Black–Scholes benchmark, the teacher is a [2, 50, 50, 50, 1] MLP and the student is [2, 20, 20, 20, 1], with similar scaling across other canonical PDEs (e.g., viscous Burgers, Allen–Cahn, and 2D Navier–Stokes) (Bounja et al., 15 Dec 2025). Physics-informed losses are computed at large batches of collocation and boundary/initial points, while the distillation set is sampled independently in the same domain.

Optimization employs the Adam optimizer, with learning rates of $10^{-3}$ – $10^{-4}$ and batch sizes of $10^3$ – $10^4$ for collocation and distillation points. Teacher models undergo full physics-based training, while students are jointly minimized under PDE, data, and distillation losses. In enhanced KD-PINN $^+$ settings, residual-weighted collocation sampling and curriculum physics schedules are used to improve robustness out of the training domain.

3. Performance: Latency, Accuracy, and Speedup Regimes

KD-PINNs achieve ultra-low-latency inference (sub-10 ms) on CPU for a wide class of PDEs. Empirical results demonstrate speed-ups between $4.8\times$ (Navier–Stokes, latency-optimized student) and $6.9\times$ (1D Burgers) relative to their teacher PINN baselines (Bounja et al., 15 Dec 2025). For example, on the Black–Scholes PDE: | Model | RMSE | rel- $L_2$ | Latency (ms) | Speed-up | |--------------|--------------|----------------|--------------|----------| | Teacher | 2.29e-3 | 1.02e-2 | 48.87 | — | | Student | 2.29e-3 | 1.02e-2 | 7.22 | 6.77× |

The distilled student matches or improves upon the accuracy of the teacher in-domain (RMSE change $<0.1\%$ in Black–Scholes; $<1\%$ in latency-optimized Navier–Stokes) despite the drastic model compression. In all tested cases, the mean RMSE increase stays below $0.64\%$ (Bounja et al., 15 Dec 2025).

For Burgers' equation, the teacher achieves latency of $32.1$ ms with $3.5\!\times\!10^{-2}$ RMSE, the student $4.6$ ms and $4.1\!\times\!10^{-2}$ RMSE; for Allen–Cahn, the student achieves a $4.87\times$ speed-up with only a $10\%$ RMSE increase.

4. Theoretical and Empirical Analysis of Latency Reduction

The latency reduction stems from the reduced FLOP count of the student vis-à-vis the teacher:

If the student shrinks hidden widths from $d_l$ to $d'_l$ , the FLOP ratio per forward pass is $R_{\rm FLOPs} = (\sum_\ell d_{\ell-1}d_\ell)/(\sum_\ell d'_{\ell-1}d'_\ell)$ , approximately $6$ in reported cases.
Taking into account Amdahl’s law (non-compute overheads) and roofline model (arithmetic intensity), the practical speedup is bounded by $(1/f) \approx 20$ for $f=5\%$ overhead, and $R_{\rm FLOPs}\cdot$ min $(1, {\rm AI}_S / {\rm AI}_T)$ . Empirical speedups of $6.8$–$6.9$ already approach this regime.

Student latency is further minimized by favoring TorchScript or torch.compile-optimized kernels, and batch inference over large input sets (e.g., $>10^4$ points). These optimizations aggregate to an average inference latency of $5.3$ ms on CPU for distilled models, qualifying as “ultra-low-latency" for real-time domains (Bounja et al., 15 Dec 2025).

5. Training Dynamics, Regularization, and Robustness

The KD-PINN training protocol reveals substantial regularization benefits from distillation. The distillation loss $L_{\rm KD}$ initially dominates and provides a strong, low-variance gradient, accelerating student convergence and smoothing the training landscape relative to direct physics-informed training (which exhibits sign oscillations and frequent loss spikes due to high-order derivative terms). Quantitative loss traces show a $10^3\times$ decay in $L_{\rm KD}$ over the first $10^3$ iterations, with monotonic student loss decay and suppressed variance.

Crucially, student residuals and errors are largely decorrelated from the teacher on the test domain ( $\rho\approx-0.18$ ), indicating both error correction and filtering of spurious teacher errors. Further, student predictions achieve $R^2>0.999$ against analytic solutions.

Enhanced KD-PINN $^+$ protocols integrating out-of-domain residual sampling, Huber losses, and curriculum physics schedules achieve $80$– $94\%$ reduction in relative extrapolation error versus the plain teacher (Bounja et al., 15 Dec 2025).

6. Interpretations, Limitations, and Position in the Neural PDE Ecosystem

KD-PINNs constitute a non-intrusive, workflow-agnostic approach to compressing generic PINN models, with immediate compatibility with any physics-informed loss, boundary/initial setups, and neural architectures (given appropriate teacher models). In comparative context, the speedups are of the same order as those achieved by specialized low-rank surrogates (e.g., LordNet: $40\times$ over classical solvers (Huang et al., 2022)), latent-space autoencoders (LNO, CALM-PDE, LE-PDE), and RNN-based Neural-PDE surrogates (Hu et al., 2020), though KD-PINNs maintain the physics-informed directness and flexibility of standard PINNs.

The main limitations are inherited from PINN methodology: performance may degrade for very stiff or high-dimensional problems, and training requires access to a well-trained teacher. Asymptotic accuracy remains bounded by that of the teacher unless regularization and out-of-domain correction are supplied.

7. Summary Table: KD-PINN Benchmarks

PDE	Teacher RMSE	Student RMSE	Teacher Latency (ms)	Student Latency (ms)	Speedup
Black–Scholes	$2.29\!\times\!10^{-3}$	$2.29\!\times\!10^{-3}$	$48.9$	$7.2$	$6.77$
Burgers	$3.49\!\times\!10^{-2}$	$4.16\!\times\!10^{-2}$	$32.1$	$4.6$	$6.92$
Allen–Cahn	$9.13\!\times\!10^{-2}$	$1.00\!\times\!10^{-1}$	$25.7$	$5.3$	$4.87$
Navier–Stokes	$1.31\!\times\!10^{-1}$	$1.30\!\times\!10^{-1}$	$19.4$ (lat-opt)	$4.1$ (lat-opt)	$4.76$

All results from (Bounja et al., 15 Dec 2025).

Knowledge-distilled PINNs have demonstrated that knowledge distillation can compress the inference workload of PINNs sufficiently to achieve sub-10 ms CPU latencies while preserving high-fidelity physical accuracy, with robust regularizing effects and scalability across canonical PDE models. They present a practical solution pathway for deploying real-time neural PDE solvers within a physics-informed machine learning ecosystem.