Physics-Informed Multimodal Foundation Model

Updated 4 January 2026

PI-MFM is a framework that integrates multimodal data encoding with physics-aware objectives to create universal surrogates for complex physical systems.
The architecture leverages dual spatial-spectral tokenization, cross-modal fusion with FiLM conditioning, and state-space backbones for efficient PDE operator learning.
Combined training objectives—including PDE residuals, boundary conditions, and data-fit losses—yield robust zero-shot transfer and heightened accuracy in sparse or noisy data regimes.

A Physics-Informed Multimodal Foundation Model (PI-MFM) is an architectural and training paradigm for universal surrogates of physical systems, emphasizing both multimodal data integration and explicit enforcement of governing physical laws during pretraining and adaptation. PI-MFM frameworks generalize classical data-driven operator-learning by incorporating physics-aware objectives and specialized token fusion mechanisms. Recent instantiations, particularly PDE-FM and related models, target scalable and data-efficient modeling of heterogeneous partial differential equation (PDE) domains, enabling robust transfer and zero-shot generalization, especially in regimes with sparse or noisy supervision (Zhu et al., 28 Dec 2025, Soares et al., 26 Nov 2025).

1. Architectural Components and Input Encoding

PI-MFM architectures combine modular multimodal encoding with symbolic physics integration. The backbone typically consists of:

Dual Spatial–Spectral Tokenization: Input fields $x_i \in \mathbb{R}^{C_i \times H \times W}$ are normalized via $1 \times 1$ convolution ( $A_i^{\mathrm{in}}$ ) to a shared latent space. Spatial patches ( $p_s \times p_s$ ) are encoded by shallow ConvNets and projected to tokens $T_{\mathrm{spatial}}$ . Spectral content is captured by truncated FFT per channel ( $T_{\mathrm{spectral}}$ ), stacking global low-frequency information; mathematically, FFT truncation is $U(k) = \int_{\Omega} u(x)e^{-2\pi i k \cdot x}\;dx, \;\; |k_x| \leq m_x$ .
Physics-Aware Conditioning: Boundary condition and physical metadata $c \in \mathbb{R}^p$ are injected via FiLM—scalars output by small MLPs $\gamma(c), \beta(c)$ modulate each patch token: $\tilde{T}_{\mathrm{spatial}} = T_{\mathrm{spatial}} \odot (1 + \gamma(c)) + \beta(c)$ .
Cross-Modal Fusion and Dual Encoder: Spatial and spectral streams are encoded separately (ConvNeXt blocks, MLPs) and fused through bidirectional cross-attention.
State-Space Backbone (Mamba): Tokens traverse $1 \times 1$ 0 layers of a linear-time state-space model (MambaLayer), expressed as discretized linear ODEs ( $1 \times 1$ 1), achieving $1 \times 1$ 2 compute per layer (Soares et al., 26 Nov 2025).
Operator-Theoretic Decoder (FNO Head): Final tokens are reshaped and upsampled onto the original grid, with a Fourier Neural Operator layer projecting to the target physical field $1 \times 1$ 3, ensuring smoothness and global coherence.
Symbolic PDE Encoding (for PINN-Style PI-MFM): PDEs are encoded as prefix-notation token sequences, e.g., add u_t mul q u_x, parsed into trees for loss assembly. Symbol and data streams are cross-attended before decoding (Zhu et al., 28 Dec 2025).

2. Physics-Informed Training Objectives and Loss Construction

PI-MFM leverages multiline objective functions combining both data and physics prior terms:

PDE Residual Loss: At collocation points $1 \times 1$ 4, residuals are computed by applying the parsed PDE tree to model predictions $1 \times 1$ 5: $1 \times 1$ 6.
Initial/Boundary Condition Losses: Enforce correct field values and derivatives at initial/boundary points, e.g., $1 \times 1$ 7.
Data-Fit Loss: Where solution samples are available, standard $1 \times 1$ 8 loss: $1 \times 1$ 9.
Total Training Objective: Linear combination with weights $A_i^{\mathrm{in}}$ 0, potentially cosine-annealed; in practice, $A_i^{\mathrm{in}}$ 1 is set to 1, others typically in the range 1–10.

Multiple recent works employ hybrid spatial–spectral loss, adding frequency-weighted penalties to enhance high-frequency fidelity (Soares et al., 26 Nov 2025). Physics invariants (conservation penalties), e.g., mass or energy, are optionally included.

3. Derivative Computation and Physics Loss Assembly

Automatic differentiation (AD) and finite-difference (FDM) schemes underpin loss construction:

AD Strategy: Vectorized JVPs produce all time/space derivatives required by symbolic loss trees. Forward-mode AD scales with coordinate/derivative orders ( $A_i^{\mathrm{in}}$ 2); reverse-mode AD is prohibitive for batchwise collocation evaluation.
FDM Strategy: Forward passes at shifted collocation points yield staggered grids; central difference stencils reconstruct required $A_i^{\mathrm{in}}$ 3 and higher derivatives.
Tradeoffs: AD is hyperparameter-free, but memory-intensive; FDM demands explicit step-size $A_i^{\mathrm{in}}$ 4 tuning, with truncation ( $A_i^{\mathrm{in}}$ 5), and round-off ( $A_i^{\mathrm{in}}$ 6) errors. Empirically, FDM (float32) and AD (float16) achieve comparable error rates ( $A_i^{\mathrm{in}}$ 7 vs. $A_i^{\mathrm{in}}$ 8 $A_i^{\mathrm{in}}$ 9 relative error) but differ by 2× in runtime (Zhu et al., 28 Dec 2025).
Loss Assembly Pseudocode:

$T_{\mathrm{spatial}}$ 0 (Zhu et al., 28 Dec 2025)

4. Pretraining, Transfer, and Adaptation

PI-MFM pretraining employs diverse, high-resolution physics datasets, typically spanning domains such as hydrodynamics (Navier–Stokes), reaction–diffusion, radiative turbulence, viscoelasticity, linear acoustics, and astrophysics (MHD). Datasets such as The Well include both 2D and 3D PDE regimes (Soares et al., 26 Nov 2025):

Sampling Strategies: Batch sampling is proportional to $p_s \times p_s$ 0 to balance dataset sizes. Training utilizes small and large batch/epoch configurations for ablation and full-scale comparison.
Domain Adaptation: Transfer to new physics domains involves only learning new input/output adapters ( $p_s \times p_s$ 1 convolutions) and adjusting FiLM metadata injections. The backbone and tokenizer remain frozen, supporting rapid and architectural-invariant adaptation (Soares et al., 26 Nov 2025).
Zero-Shot Physics-Informed Fine-Tuning: Pretrained models can be adapted to entirely new PDE families or regimes using only physics residual and initial/boundary loss terms ( $p_s \times p_s$ 2). Empirically, zero-shot adaptation achieves $p_s \times p_s$ 3 $p_s \times p_s$ 4 error in 3k gradient steps, outperforming physics-only training from scratch (Zhu et al., 28 Dec 2025).

5. Experimental Benchmarks and Quantitative Evaluation

Empirical evaluation encompasses sparse, partial, and noisy-label regimes, as well as cross-domain robustness:

Sparse-Label Supervision: PI-MFM reduces test errors from $p_s \times p_s$ 5 ( $p_s \times p_s$ 6, data-only at $p_s \times p_s$ 7 grid) to $p_s \times p_s$ 8, with largest gains at lowest resolution (Zhu et al., 28 Dec 2025). Data-efficient function pair learning demonstrates <1% error with $p_s \times p_s$ 9 labeled samples vs. >20,000 for data-only.
Physical Consistency: Weighted spectral losses enhance high-frequency accuracy. Conservation constraint terms (mass, energy) further regularize solutions.
State-of-the-Art Comparison Table (Soares et al., 26 Nov 2025):

| Dataset | FNO | TFNO | CNextU-net | PhysiX | PI-MFM | |-------------------------------|---------|---------|------------|--------------|----------| | Shear Flow | 1.1890 | 1.4720 | 0.8080 | 0.0700 | 0.0345 | | Rayleigh–Bénard | 0.8395 | 0.6566 | 0.6699 | 0.1470 | 0.0415 | | Turbulence Gravity Cooling | 0.2429 | 0.2673 | 0.2096 | — | 0.0796 | | Acoustic Scattering | 0.5062 | 0.5057 | 0.0153 | 0.0960 | 0.0487 | | Viscoelastic Instability | 0.7212 | 0.7102 | 0.2499 | 0.2370 | 0.5204 |

Ablation studies confirm performance gains from Mamba backbone, spectral tokenization, cross-attention, and FiLM conditioning (Soares et al., 26 Nov 2025). Primary failure mode remains in viscoelastic turbulence, suggesting a lack of inductive bias or explicit memory mechanisms.

6. Limitations, Trade-Offs, and Future Directions

Key limitations of PI-MFM include:

Computational Cost: AD yields high memory/runtime demand; FDM requires step-size tuning. For high-dimensional or large-domain problems, physics residual evaluation overhead remains significant (Zhu et al., 28 Dec 2025, Soares et al., 26 Nov 2025).
Domain Coverage: Most experiments focus on 1D time-dependent or periodic 2D/3D PDEs; extension to nonperiodic, complex boundary geometries remains an open challenge.
Physics Representation: Integration of nonlocal, delayed, or elastic behaviors (e.g., viscoelastic turbulence) is suboptimal without specialized architectural components. Purely convolutional surrogates sometimes outperform in stationary or highly stiff regimes.

Future research directions include adaptive collocation (dynamic point resampling), higher-order gradient regularization, curriculum-based multi-physics pretraining, meta-learning of physics weights, and hybridization with traditional solvers (finite-element or mesh-based integration). Uncertainty quantification, federated/distributed training, and retrieval-augmented physics modules also present viable paths forward (Zhu et al., 28 Dec 2025, Soares et al., 26 Nov 2025, Farhadloo et al., 20 Feb 2025).

7. Theoretical Context and Extensions

PI-MFM generalizes the concept of Physics-Guided Foundation Models (PGFM) by fusing large-scale multimodal pretraining, physics-constrained loss regularization, and physics-aware architectural biases (Farhadloo et al., 20 Feb 2025). The use of symbolic PDE encoding and automatic physics-loss assembly builds on techniques from physics-informed neural networks (PINNs) and DeepONets but extends them to the multimodal, foundation-model scale. Product-of-Experts fusion for unsupervised disentanglement, as utilized in PIMA (Trask et al., 2022), can serve as a template for scalable embedding of scientific fingerprints.

A plausible implication is that PI-MFM models, by leveraging physics loss as regularizer and transfer enabler, mark a significant step toward universal, robust, and data-efficient multi-operator solvers for scientific discovery and simulation. However, practical deployment at full scale will require new advances in physics-token integration, operator decoding, and adaptive training across multi-domain, multi-modal regimes.