Universal ML Models with Atomic Fragments
- Machine learning universal models using atomic fragments decompose complex atomic systems into localized environments for efficient and transferable property prediction.
- Atomic fragments are encoded via methods like SOAP, ACE, and learned embeddings, providing invariant and graph-based features for robust kernel and deep neural network architectures.
- These models achieve state-of-the-art results in energy, force, and molecular generation tasks while highlighting challenges in long-range interactions and out-of-distribution generalization.
Machine learning universal models using atomic fragments constitute a principled, physically grounded, and scalable approach for representing, learning, and predicting properties of molecules, materials, and extended condensed matter systems. The key unifying idea is the decomposition of complex atomic systems into localized environments or “atomic fragments,” which serve as universal building blocks. These fragments are mathematically encoded as feature vectors, tensors, or graph-based embeddings and are input to statistical or neural architectures that can generalize across chemical domains, sizes, and configurations.
1. Atomic Fragment Representations and Extraction
Atomic fragment representations formalize the local environment of an atom as the fundamental unit for ML in chemistry and materials science. Across methods, the typical atomic fragment comprises the central atom and all neighbors within a fixed cutoff radius, capturing both chemical identity and spatial arrangement.
Canonical approaches include:
- Smooth Overlap of Atomic Positions (SOAP): Represents a local atomic environment by projecting a Gaussian-smeared neighbor density onto a basis of radial functions and spherical harmonics, yielding a rotationally invariant “power spectrum” vector for each atom. This forms the basis for kernel methods and Gaussian Process regression (Bartok et al., 2017, Rosenbrock et al., 2017).
- Cluster and Graph-Based Fragments: The atomic cluster expansion (ACE) encodes environments as products of radial and angular basis functions, forming star- and tree-graph invariants (Lysogorskiy et al., 25 Aug 2025).
- Cartesian Moments: The CAMP framework uses symmetric Cartesian tensors constructed from neighbor positions and applies tensor products to systematically encode higher-body correlations (Wen et al., 2024).
- Learned Embeddings: Approaches such as SkipAtom map discrete atom types to continuous vectors via unsupervised prediction tasks over crystal graphs, allowing compositional or structure-agnostic pooling (Antunes et al., 2021).
- Fragment-Based and "Amon" Selection: For molecules, relevant atom-in-molecule fragments (“amons”) are selected on-the-fly, preserving chemical hybridization and topology. These are saturated with hydrogens, relaxed, and used to reconstruct quantum properties of larger systems (Huang et al., 2017).
Table: Representative fragment encoding schemes
| Method | Fragment definition | Representation type |
|---|---|---|
| SOAP / GAP (Bartok et al., 2017) | Atom + neighbors (cutoff) | SOAP vector (power spectrum) |
| ACE / GRACE (Lysogorskiy et al., 25 Aug 2025) | Star/tree-graph clusters (ordered products) | Tensor invariants |
| CAMP (Wen et al., 2024) | Atom + neighbors (cutoff) | Cartesian tensors/moments |
| SkipAtom (Antunes et al., 2021) | Atom type/context pairs from database graphs | Low-dim embeddings |
| AML (Huang et al., 2017) | Subgraphs preserving hybridization/topology | aSLATM or atomic spectra |
Fragment definitions are chosen to balance physical completeness, invariance to global symmetry operations, and computational feasibility.
2. Model Architectures: From Kernels to Deep Message Passing
ML models leveraging atomic fragments fall into two main classes:
Kernel-Based (Gaussian/Kernel Ridge Regression, GPR)
SOAP-based models assign an energy contribution to each atomic fragment and define similarity using positive-definite kernels (e.g., dot product of normalized SOAP vectors raised to a power ). These kernels are aggregated over all pairs of atomic environments in two structures to define global similarities. Energies, forces, or other properties are then regressed via kernel methods (Bartok et al., 2017, Rosenbrock et al., 2017, Huang et al., 2017).
Deep Learning and Message-Passing Network Approaches
Recent progress has been driven by deep neural architectures that encode atomic fragments via invariant or equivariant graph neural networks (GNNs):
- ACE/GRACE: Graph GNN architectures recursively build higher-body symmetric tensor features and propagate them through message passing, capturing complex many-body effects efficiently (Lysogorskiy et al., 25 Aug 2025).
- CAMP: Uses layers of message passing over symmetric Cartesian moment tensors, allowing systematically improvable body order and angular detail (Wen et al., 2024).
- UMA: Employs an eSCN/SO(2)-equivariant architecture augmented with a Mixture-of-Linear-Experts (MoLE) design, scaling up capacity while maintaining fast inference. Atomic fragments are expanded in a radial–angular basis for all relevant tasks (Wood et al., 30 Jun 2025).
- Transformer-based (PET): Adopts point–edge transformers with attention over atomic graphs, designed for highly general properties such as the electronic density of states (DOS) (How et al., 24 Aug 2025).
- Fragment-based Molecular Generative Models: Auto-regressively add molecular fragments learned from a synthetically motivated library to generate molecules with targeted properties, leveraging graph convolutions for bonding and atom selection (Seo et al., 2021).
- LLMs: Unique atomic identifiers provide anchor points for substructure reasoning over SMILES or SELFIES strings; LLMs can select chemically meaningful fragments with chain-of-thought prompts (Hassen et al., 18 Oct 2025).
3. Universality and Generalization: Scaling Laws, Alignment, and Transfer
Universal fragment-based models aim to achieve chemically and configurationally agnostic transfer across broad domains:
- Scaling Laws: UMA empirically measures performance loss as a function of model and dataset size, mapping validation loss as , and identifies compute-optimal regimes for both dense and MoLE architectures (Wood et al., 30 Jun 2025). GRACE demonstrates that a single model can span 89 elements without retraining (Lysogorskiy et al., 25 Aug 2025).
- Representational Alignment: An important insight from cross-model studies is that high-performing models, regardless of modality (3D, graph, string), converge in their learned fragment embeddings. Representational alignment between models can be quantified by metrics such as CKNNA and distance correlation, with alignment scores marking universality (Edamadaka et al., 3 Dec 2025).
- Transferability: Fine-tuning or adapter-based approaches (e.g., LoRA in PET-MAD-DOS) enable rapid adaptation of generic fragment models to new chemistries or tasks with minimal bespoke data, without loss of generality (How et al., 24 Aug 2025).
- Active Learning and Coverage: AML (Huang et al., 2017) and SOAP-GAP (Bartok et al., 2017) demonstrate that local fragment dictionaries expand on-the-fly to ensure systematic convergence and error control, with overall cost scaling in the number of unique environments.
4. Practical Applications and Benchmarks
Universal fragment-based approaches have demonstrated competitive or state-of-the-art results across tasks and scales, spanning:
- Energy and Force Prediction: MLIPs such as UMA (Wood et al., 30 Jun 2025), DPA-Semi (Liu et al., 2023), CAMP (Wen et al., 2024) and GRACE (Lysogorskiy et al., 25 Aug 2025) approach or match DFT accuracy for diverse materials, phases, and temperatures, enabling reliable MD simulations of large systems.
- Electronic Structure: PET-MAD-DOS predicts the electronic DOS of molecules, solids, and alloys, enabling derived thermodynamic and spectroscopic properties without explicit DFT calculations (How et al., 24 Aug 2025).
- Molecular Generation: FMGM controls multiple chemical properties and generates synthesizable molecules through fragment-by-fragment assembly, showing generalization even to unseen fragments (Seo et al., 2021).
- Interpretability and Discovery: The LER approach maps importance weights to local fragment types, facilitating physical insights into which structural motifs (e.g., dislocations, stacking faults) control macroscopic properties (Rosenbrock et al., 2017).
- Chemical Reasoning: Atom-anchored LLM pipelines exploit fragment referents as attention targets, boosting zero-shot retrosynthesis performance in LLMs to in site identification (Hassen et al., 18 Oct 2025).
Table: Selected performance highlights
| Task/Domain | Model/Framework | Key Metric |
|---|---|---|
| Formation energy (solids) | UMA (Wood et al., 30 Jun 2025), GRACE (Lysogorskiy et al., 25 Aug 2025) | MAE 0.02 eV/atom |
| DOS prediction (general) | PET-MAD-DOS (How et al., 24 Aug 2025) | RMSE 0.15 eVelectronstate |
| Molecular design/generation | FMGM (Seo et al., 2021) | Validity 97%, Uniqueness 93% |
| Retrosynthesis reactions | Atom-anchored LLM (Hassen et al., 18 Oct 2025) | Actionable reactant accuracy |
| Grain boundary energy/mobility | SOAP+LER (Rosenbrock et al., 2017) | Classification accuracy |
| Amons/AML transfer to biomolecules | AML (Huang et al., 2017) | Protein MAE kcal/mol (77–602 heavy atoms) |
5. Limitations and Open Challenges
Major limitations and active directions include:
- Long-Range Interactions: Most models use cutoff-based fragments neglecting explicit long-range electrostatics or van der Waals; future schemes must augment local features with charge or multipole fields (Liu et al., 2023, Lysogorskiy et al., 25 Aug 2025).
- Out-of-Distribution (OOD) Generalization: Models collapse to low-information representations on OOD inputs, indicating that even foundation models remain fundamentally limited by training data diversity and inductive bias (Edamadaka et al., 3 Dec 2025).
- Fragment Granularity: Simple atom-type embeddings (e.g., SkipAtom (Antunes et al., 2021), Bag-of-Atoms) do not resolve oxidation or electronic state unless made explicit; geometric fragment representations may require extension to handle excited states, reactivity, or rare environment types.
- Efficiency and Scalability: Implementation bottlenecks persist for high-rank tensors (CAMP (Wen et al., 2024)), or full-graph message passing at scale; optimized data structures and distributed pipelines are under active development.
- Benchmarks and Evaluation: Universal metrics for transferability, such as representational alignment or cross-domain performance, continue to be refined; no single metric suffices for all application regimes (Edamadaka et al., 3 Dec 2025).
6. Outlook and Perspectives
Universal fragment-based models are transitioning from specialized algorithms to general scientific foundation models. The convergence of representations across modalities and tasks demonstrates that a universal statistical description of matter is attainable through atomic environments, provided the models, data, and architectures are sufficiently diverse and expressive (Edamadaka et al., 3 Dec 2025).
Future directions include:
- Integrating explicit long-range and multi-scale interactions in fragment features (Lysogorskiy et al., 25 Aug 2025, Liu et al., 2023).
- Extending fragment libraries dynamically via active learning or unsupervised discovery (Huang et al., 2017).
- Increasing the chemical and structural diversity of pretraining datasets to drive alignment and transfer (Wood et al., 30 Jun 2025).
- Developing rapid fine-tuning and knowledge distillation techniques for task-adapted, resource-efficient models (How et al., 24 Aug 2025).
- Unifying symbolic and neural representations for chemically interpretable, substructure-aware reasoning (e.g., atom-anchored LLM frameworks) (Hassen et al., 18 Oct 2025).
The fragment-centric paradigm is a cornerstone for scalable, interpretable, and transferable machine learning in atomistic simulation, material discovery, and chemical informatics.