Descriptor-Based Physics-Informed ML
- Descriptor-based PIML is a framework that maps high-dimensional data to compact, physically interpretable descriptors, retaining key mechanistic insights.
- It integrates explicit physical laws and symmetries into ML architectures using techniques like PDE residual minimization and operator-algebraic enforcement.
- The approach enhances predictive efficiency and interpretability across domains such as condensed matter, fluid dynamics, and material sciences.
Descriptor-based analysis in physics-informed machine learning (PIML) is a methodological paradigm in which physical phenomena are encoded using low-dimensional, interpretable sets of mathematical descriptors, which are then embedded within learning architectures that explicitly encode physical structure, symmetries, and conservation laws. This blending of compact, domain-tailored feature design with strong physical priors enables efficient, generalizable, and interpretable surrogate modeling across a broad spectrum of condensed matter, fluid, and materials systems.
1. Foundations of Descriptor-Based Physics-Informed ML
Descriptor-based PIML proceeds from the premise that high-dimensional raw data or simulations can be mapped to reduced, physically interpretable representations which retain essential mechanistic information. These descriptors can be geometric (e.g., local coordination, curvature), topological (e.g., ring statistics), statistical (e.g., accessible volume distributions), or symmetry-invariant polynomial features derived from group-theoretical expansions. Their judicious selection encodes the relevant physics, drastically constraining the phase space presented to machine learning regressors or classifiers.
In conjunction, the learning architecture—whether a neural surrogate or operator-based method—incorporates explicit forms of governing physical laws, often through penalization of PDE residuals, conservation law enforcement, or even strong imposition of operator algebraic structure. This framework allows the model to extrapolate to unobserved regimes, process domain-shifted data robustly, and yield predictions consistent with fundamental principles (Zhu et al., 17 Feb 2026, Rampal et al., 20 Aug 2025, Hawthorne et al., 2 Apr 2026, Zhang et al., 2022, Trask et al., 2020).
2. Construction and Mathematical Structure of Descriptors
Descriptor formulation is highly domain-specific but governed by the imperative to encode key driving physics at appropriate scales:
- Anatomical and flow surrogates: In vascular modeling, descriptors include centerline lumen area , curvature , stenosis , and time-resolved contrast metrics , forming a vector (Zhu et al., 17 Feb 2026).
- Atomic/molecular environments: For battery materials and ionic transport, descriptors involve coordination numbers (), accessible volume , tortuosity , and derived single-particle diffusivities (Rampal et al., 20 Aug 2025).
- Low-dimensional materials/topology: In 2D carbon networks, C2DTD concatenates local geometric statistics, a radial distribution function signature, and primitive ring-count fractions , producing a 70-dimensional, invariant vector capturing both short- and medium-range order and explicit network topology (Hawthorne et al., 2 Apr 2026).
- Group-theoretical invariants: In generalized force-field modeling, the descriptor is constructed via irreducible-representation expansion and computation of invariants (power spectrum, bispectrum coefficients) of local field neighborhoods, preserving required physical symmetries (Zhang et al., 2022).
- Exterior calculus on graphs: Metrics (e.g., learned mesh weights) are associated with combinatorial chains and cochains, encoding discrete analogues of gradient, divergence, and curl for field-theoretic ML (Trask et al., 2020).
Across all these, descriptors are constructed to be invariant under relevant group actions (rotations, translations, permutation of atom indices, internal symmetries), normalized to ensure comparability, and kept as compact as possible to minimize overfitting risk in low-data regimes.
3. Integration of Physical Laws and Symmetry
Physics-enforcement within the ML pipeline is accomplished through a variety of mechanisms:
- PDE-constrained Training: Composite loss functions combine standard data-fidelity terms (), physics-based residuals (0) penalizing violation of the governing equations (e.g., Navier-Stokes, conservation of mass and momentum), and boundary condition penalties (1). These are minimized over network parameters, ensuring surrogates are congruent with physical law (Zhu et al., 17 Feb 2026).
- Operator-algebraic enforcement: Neural operators and data-driven exterior calculus models are constructed so that conservation (e.g., 2 for incompressibility, 3 for exterior calculus) is satisfied to machine precision, leading to structural guarantees. Trainable metric parameters are introduced in the operator definitions, but the exactness of the sequence is preserved (Trask et al., 2020).
- Symmetry-preserving descriptors: Group-theoretical approaches guarantee invariance under point-group and internal symmetries, yielding local energies or forces that respect the structure of the original Hamiltonian (Zhang et al., 2022).
- Hybridization with computational physics: Descriptor-based input reduces the dimensionality, but the network is explicitly tasked with learning solution families consistent with the underlying physics across the descriptor space (e.g., different vessel geometries or local environments) (Zhu et al., 17 Feb 2026, Rampal et al., 20 Aug 2025).
4. Architectures and Algorithms
Several distinct neural and operator network types are employed in descriptor-based PIML:
- Physics-Informed Neural Networks (PINNs): Feedforward MLPs (8–12 layers; 50–100 neurons) take concatenated descriptor and coordinate input, with outputs corresponding to field variables (e.g., velocity, pressure). The training objective enforces both data fit and PDE residual minimization (Zhu et al., 17 Feb 2026).
- Physics-Informed Neural Operators (PINOs/FNOs): These architectures "lift" descriptors and boundary conditions to high-dimensional latent spaces, process them via spectral (Fourier) convolutions, and project back to physical observables, facilitating rapid, geometry-agnostic predictions. The input channel structure is especially suited for families of problems parameterized by descriptors (Zhu et al., 17 Feb 2026).
- XGBoost and Random Forests: For environments where interpretability and inferential clarity are critical, gradient-boosted trees operating on compact descriptor vectors are favored, as in C2DTD modeling of 2D carbons (Hawthorne et al., 2 Apr 2026) and transport mode classification/regression in hard carbon (Rampal et al., 20 Aug 2025).
- Graph-based Neural Surrogates: In scientific domains mapped to discrete topology, learned metric matrices modulate discrete exterior calculus operators, with surrogate models trained via constrained optimization to guarantee conservation and exact sequences (Trask et al., 2020).
5. Benchmarking, Metrics, and Validation
Performance assessment in descriptor-based PIML leverages both domain-specific and general metrics:
- Regressive accuracy: MAE or RMSE of target observables to reference data (e.g., FFR differences 40.03, energy RMSE in C2DTD 50.14 eV) (Zhu et al., 17 Feb 2026, Hawthorne et al., 2 Apr 2026).
- Feature and mode classification: Cross-validated accuracy and feature importance scoring in diffusion-mode prediction or vacancy-type categorization (e.g., classification accuracy ∼89% in Na-transport) (Rampal et al., 20 Aug 2025).
- Calibration and uncertainty: Coverage probability for confidence intervals, ECE, and reliability curves to assess deployment readiness and quality control (Zhu et al., 17 Feb 2026).
- Domain generalizability: Pearson/Spearman coefficients for structure-transport/property relationships, validation over broad parameter sweeps (physical density, defect levels, etc.) (Rampal et al., 20 Aug 2025, Hawthorne et al., 2 Apr 2026).
- Computational efficiency: Inference latency is a key metric for clinical and high-throughput deployment (e.g., PINO 61 s vs. CFD 7200 s for FFR) (Zhu et al., 17 Feb 2026).
6. Interpretability, Generalizability, and Inductive Bias
A principal advantage of descriptor-based PIML lies in interpretability and mechanistic transparency:
- Interpretability: Descriptors map directly to physical attributes (e.g., stenosis, ring fractions, coordination numbers), enabling sensitivity and ablation analysis (e.g., topological descriptors 8, 9, 0 dominate energy variance in graphene) (Hawthorne et al., 2 Apr 2026).
- Inductive Bias: Explicit embedding of dominant physics, multi-scale organization, and symmetry requirements constrains the hypothesis space, curbing overfitting and supporting robust operation in small-data, defect-rich, or extrapolative regimes (Hawthorne et al., 2 Apr 2026, Zhang et al., 2022).
- Robustness: Physics-informed loss and architecture stabilize against domain shift, ensuring physically plausible output even in out-of-distribution scenarios (Zhu et al., 17 Feb 2026).
7. Domain-Specific Applications and Design Principles
Descriptor-based PIML has demonstrated efficacy across diverse scientific domains:
| Domain | Key Descriptor Types | Physics Constraints/Architecture |
|---|---|---|
| Coronary flow and FFR | 1 | PINN/PINO, Navier-Stokes, BCs |
| Hard carbon Na-ion transport | CN, AV, 2, 3 per trajectory | ML-IP+MD, mode clustering, XGBoost |
| 2D Carbon materials | Local geom., radial signature, ring fractions | Invariant, interpretable vector |
| Correlated electron systems | Bispectrum, group invariants of neighborhoods | NN/kernel, symmetry-exact |
| PDE surrogates on graphs | Learned cochain metrics, boundary observables | DDEC, exact conservation |
Critical design principles emerging from current research include: explicit encoding of known energetic or mechanistic drivers, multi-scale channel separation, minimality in feature space dimensionality, physical normalization, enforced invariances, and preservation of interpretability for inference and error analysis (Hawthorne et al., 2 Apr 2026, Zhu et al., 17 Feb 2026).
8. Theoretical Guarantees and Future Directions
Strong guarantees of conservation, uniqueness and well-posedness are attainable via algebraic enforcement within the descriptor-based, physics-informed learning paradigm. For instance, the data-driven exterior calculus approach ensures exact boundary and conservation properties by construction, even under imperfect fitting (Trask et al., 2020). Descriptor-based PIML continues to expand toward broader classes of materials, multi-physics problems, and automated synthesis of descriptors, offering a unified framework for physics-compatible, data-driven modeling in the physical sciences.