Physics-Guided Transformer (PGT)

Updated 5 June 2026

Physics-Guided Transformer (PGT) is a neural architecture that embeds PDE-based attention, integrating Green’s function biases to enforce physical laws.
It employs physics-aware self-attention, FiLM-modulated sinusoidal decoders, and adaptive loss weighting to enhance stability and data efficiency.
Empirical results demonstrate significant error reductions on benchmarks like the heat equation and Navier–Stokes, highlighting its superior performance over traditional methods.

A Physics-Guided Transformer (PGT) is a neural architecture that integrates the mathematical structure of governing physical laws—especially those expressed as partial differential equations (PDEs)—directly into the attention mechanism, representation, and training objectives of transformer-based models. This design enables physically consistent field reconstruction, enhanced stability and generalization under severe data sparsity, and physically meaningful inductive biases that are unobtainable by architectures relying solely on soft PDE penalty terms or purely data-driven representations (Zeraatkar et al., 30 Mar 2026).

1. Architectural Foundations: Embedding Physics into Self-Attention

The core architectural innovation of PGT is the replacement of standard content-only attention with physics-aware attention. In the canonical formulation for parabolic PDEs (e.g., the heat equation), the self-attention logit between two context tokens is augmented by the log of the Green's function (the heat kernel) associated with the PDE: $\Gamma_{ij} = \log G(x_i - x_j, t_i - t_j) = -\frac{\|x_i - x_j\|^2}{4 \alpha \Delta t_{ij}} - \frac{d}{2} \log(4\pi \alpha \Delta t_{ij})$ with $\Delta t_{ij} = t_i - t_j$ and strict causality enforced by setting $\Gamma_{ij} = -\infty$ for $\Delta t_{ij} \le 0$ . The self-attention thus becomes: $\text{Attn}(Q,K,V) = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d_k}} + \Gamma\right) V$ This mechanism ensures that only causally valid, physically local contributions (in the diffusion sense) are emphasized, embedding diffusion locality and temporal causality at the representational level. In the limit of infinite diffusivity, attention becomes uniform; in the vanishing diffusivity regime, attention recovers a Dirac kernel (Zeraatkar et al., 30 Mar 2026).

2. Context-Conditioned Implicit Decoding: FiLM-Modulated SIREN

Decoding in PGT utilizes a sinusoidal-activated implicit neural representation (SIREN) whose spectral characteristics are adaptively controlled. For any query coordinate $(x, t)$ , the model first encodes it with cross-attention against the physics-conditioned context tokens. The resulting feature vector $g(q)$ , together with the global context token, modulates the SIREN decoder via Feature-wise Linear Modulation (FiLM): at every hidden layer,

$h_{\ell+1} = \sin \left( \omega_{\ell} \odot ( \alpha_{\ell} \odot z_\ell + \beta_\ell ) \right)$

where $(\alpha_\ell, \beta_\ell, \omega_\ell)$ are learned from the local-global context. This allows spatially adaptive tuning of amplitude, shift, and effective frequency response, facilitating representation of both smooth and high-frequency field content (Zeraatkar et al., 30 Mar 2026).

3. Training Objectives and Automatic Loss Reweighting

PGT addresses critical gradient imbalance and stability issues typical in Physics-Informed Neural Networks (PINNs)—which traditionally aggregate data-fit, PDE residual, boundary, and initial condition losses by hand-tuned scalars—by introducing learned uncertainty weights $\{\sigma_k\}$ for each loss component: $\Delta t_{ij} = t_i - t_j$ 0 where $\Delta t_{ij} = t_i - t_j$ 1, $\Delta t_{ij} = t_i - t_j$ 2, $\Delta t_{ij} = t_i - t_j$ 3, and $\Delta t_{ij} = t_i - t_j$ 4 correspond to observational, equation-residual, boundary, and initial condition losses, respectively. As $\Delta t_{ij} = t_i - t_j$ 5 are trainable, the model adaptively balances gradient magnitudes and eliminates the need for manual $\Delta t_{ij} = t_i - t_j$ 6-weight selection (Zeraatkar et al., 30 Mar 2026).

4. Empirical Results and Ablations

PGT achieves state-of-the-art performance under severe data constraints on canonical benchmark problems:

1D Heat Equation: With $\Delta t_{ij} = t_i - t_j$ $Δ t_{ij} = t_{i} - t_{j}$ 7 randomly sampled points,
- Relative L2 error: $\Delta t_{ij} = t_i - t_j$ 8 (38× better than PINN; 90× better than SIREN).
- PDE residual: $\Delta t_{ij} = t_i - t_j$ 9.
- Monotonic error decay to $\Gamma_{ij} = -\infty$ 0, in contrast with PINN/SIREN plateaus.
2D Incompressible Navier–Stokes (Cylinder Wake): With $\Gamma_{ij} = -\infty$ $Γ_{ij} = - \infty$ 1 scattered points,
- Relative L2 error: $\Gamma_{ij} = -\infty$ 2, PDE residual: $\Gamma_{ij} = -\infty$ 3.
- Competing methods either fail on data-fit or on residual when budgets are as limited.
- Ablations show: removing the physics bias $\Gamma_{ij} = -\infty$ 4 degrades accuracy; omitting explicit PDE loss triples the residual; disabling FiLM or sinusoidal activations causes significant performance loss.

These results confirm that the synergy of physics-guided attention, spectral FiLM modulation, and robust uncertainty-weighted training is essential to achieving both physical consistency and data fidelity in extreme data-sparse regimes (Zeraatkar et al., 30 Mar 2026).

PGT shares guiding philosophy with other operator-based and physics-injected transformer architectures but is distinguished by its strict hard-coding of the exact Green’s function kernel into the attention bias. In contrast:

PGOT uses spectrum-preserving geometric attention and physics slicing but focuses on geometric detail preservation rather than explicit Green's function bias (Zhang et al., 29 Dec 2025).
GeoTransolver performs physics-aware attention over mesh slices with persistent geometry/context injection, but does not explicitly hard-code PDE fundamental solutions (Adams et al., 23 Dec 2025).
Physics-Guided Multimodal Transformers for weather and climate integrate physical knowledge at input, model, and output levels via loss constraints and projected embeddings, not via PDE kernel biases (Han et al., 19 Apr 2025).
GAN-augmented Physics-Informed Transformers utilize adversarial selection of high-residual points and causal penalties but retain content-only or spatially masked attention (Zhang et al., 15 Jul 2025).

PGT thus provides a general template for incorporating physically meaningful attention biases in operator learning.

6. Limitations, Scalability, and Generalization

While PGT’s embedding of Green's function structure ensures strict compliance with diffusion or convection-dominated causality, the quadratic complexity ( $\Gamma_{ij} = -\infty$ 5 attention) and size of FiLM hypernetworks lead to substantial computational overhead relative to PINNs or operator networks such as FNOs. Techniques like sparse/hierarchical attention, kernel factorization, and mixed-precision training are identified directions for reducing these barriers.

Extension of the PGT framework to turbulent, multiphysics, hyperbolic, and stochastic PDEs is mathematically tractable (e.g., by exchanging the heat kernel for the wave-equation kernel to enforce light-cone causality), but empirical validation in highly nonlinear, high-dimensional, or noisy regimes is an open area. Similarly, integrating high-resolution outputs from Earth System Models or large unstructured domains will require memory-efficient, possibly distributed architectural modifications (Zeraatkar et al., 30 Mar 2026).

7. Conceptual Impact and Future Directions

PGT reframes the core principle of physics-informed machine learning: rather than treat governing equations as soft loss terms appended during training, it embeds the fundamental structure of PDEs as architectural priors in the attention mechanism. This leads to models that are more stable, generalizable, and physically transparent than those trained with unconstrained or weakly regularized objectives. A plausible broader implication is that bridging architectural design and PDE theory in this fashion will allow neural surrogate models to serve as reliable scientific simulators and experiment-guidance tools, even in extreme data-scarce regimes. Moreover, as demonstrated in related work on physics foundation models and geometry-aware transformers, expansion of these principles to 3D, multi-physics, and multimodal workflows is under active investigation (Zeraatkar et al., 30 Mar 2026, Zhang et al., 29 Dec 2025, Han et al., 19 Apr 2025, Wiesner et al., 17 Sep 2025).