Papers
Topics
Authors
Recent
Search
2000 character limit reached

Breaking the Training Barrier of Billion-Parameter Universal Machine Learning Interatomic Potentials

Published 17 Apr 2026 in cs.DC and cs.LG | (2604.15821v1)

Abstract: Universal Machine Learning Interatomic Potentials (uMLIPs), pre-trained on massively diverse datasets encompassing inorganic materials and organic molecules across the entire periodic table, serve as foundational models for quantum-accurate physical simulations. However, uMLIP training requires second-order derivatives, which lack corresponding parallel training frameworks; moreover, scaling to the billion-parameter regime causes explosive growth in computation and communication overhead, making its training a tremendous challenge. We introduce MatRIS-MoE, a billion-parameter Mixture-of-Experts model built upon invariant architecture, and {Janus}, a pioneering high-dimensional distributed training framework for uMLIPs with hardware-aware optimizations. Deployed across two Exascale supercomputers, our code attains a peak performance of 1.2/1.0 EFLOPS (24\%/{35.5\%} of theoretical peak) in single precision at over 90\% parallel efficiency, compressing the training of billion-parameter uMLIPs from weeks to hours. This work establishes a new high-water mark for AI-for-Science (AI4S) foundation models at Exascale and provides essential infrastructure for rapid scientific discovery.

Summary

  • The paper introduces a novel MatRIS-MoE architecture combined with the Janus framework, dramatically accelerating billion-parameter uMLIP training.
  • It employs a sparse, element-aware MoE routing strategy with second-order optimization and FS-3D parallelism to enhance performance and efficiency.
  • Results demonstrate quantum-accurate predictions across diverse simulation domains, achieving >90% parallel efficiency and near-linear weak scaling at Exascale.

Breaking the Training Barrier of Billion-Parameter Universal Machine Learning Interatomic Potentials

Introduction

Universal Machine Learning Interatomic Potentials (uMLIPs) are increasingly central in quantum-accurate molecular simulations, offering transferable surrogate models in lieu of expensive quantum mechanical calculations. The unprecedented scaling of uMLIPs, from millions to over a billion parameters, promises robust applicability across heterogeneous domains (organic molecules, inorganic materials, catalytic interfaces, etc.), but also poses a formidable challenge: efficient, accurate, and scalable training given the prohibitive computational and communication costs instigated by second-order optimizations, single-precision requirements, and extreme throughput demands.

This paper presents MatRIS-MoE, a billion-parameter Mixture-of-Experts (MoE) uMLIP based on the invariant MatRIS framework, alongside Janus, a highly-optimized, hardware-aware parallel framework enabling efficient second-order training at Exascale. The system achieves 1.2 EFLOPS peak performance at >90% parallel efficiency, compressing billion-parameter uMLIP training times from weeks to hours and defining a new operational regime for foundation models in scientific simulation.

Technical Advances in MatRIS-MoE Architecture

To address the limitations inherent in existing uMLIPs, the authors scale the invariant MatRIS architecture—originally optimized for Pareto-efficiency in 10M-parameter models—to the billion-parameter regime through several architectural and algorithmic innovations.

MatRIS-MoE represents atomic configurations as graphs under periodic boundary conditions, leveraging a pipeline of feature embedding, invariant message passing, and readout blocks. Within the interaction blocks, a transition is made from O(N)O(N) separable attention to compute-intensive multi-head self-attention, improving GPU utilization while enhancing many-body expressiveness. Critically, the architecture introduces a sparse, element-aware MoE routing strategy, with routing determined by chemical identity and task context, permitting both specialization across chemical domains and compute/memory efficiency. Figure 1

Figure 1: Architecture overview of MatRIS and core components: periodic graph construction, invariant message passing, separable attention, and refinement.

Task alignment in multi-domain datasets is enabled through per-task/dataset embeddings, charge and spin encodings, and global feature vectors, all supporting robust cross-domain performance. For robust optimization, a multi-task, robust loss is employed, leveraging batch-wise normalized per-task outlier mitigation to stabilize training across highly heterogeneous samples.

Scalable Training with the Janus Framework

The central technical contribution is the development of Janus, a hybrid-parallel training framework uniquely capable of supporting double-backward (second-order) uMLIP optimization at scale—in contrast to previous works focused on first-order LLM training.

Janus implements FS-3D (Fully Sharded 3 Dimensions), integrating:

  • FSDP (fully sharded data parallelism): Reduces static memory usage.
  • FSGP (fully sharded graph parallelism): Splits atomic graphs across devices.
  • FSEP (fully sharded expert parallelism): Distributes MoE expert parameters, exploiting just-in-time expert routing/planning for load balancing and communication reduction.

This enables sharded execution, memory savings, expert sparsification, and guarantees balanced resource utilization. Crucially, Janus supports a double-backward-aware lifecycle with multi-phase, pipelined computation and deferred gradient synchronization, maximizing overlap and minimizing idle time in Exascale settings. Figure 2

Figure 2: System overview: model architecture, framework-level (FSDP, FSEP, FSGP) and platform-level optimizations, and FS-3D parallelization.

The system provides hardware-specific optimizations: pipelined gradient synchronization, atom-type-aware FP16 communication compression for MoE routing (compensating for bandwidth constraints), and tailored kernel engineering for graph and MoE operators. On ARM-based platforms (LineShine), further optimizations in memory movement (SDMA engines), and async MPI scheduling leverage hardware characteristics for maximal throughput. Figure 3

Figure 3: MatRIS-MoE execution timeline; block-level breakdown of operator computation, prefetching, communication, and gradient synchronization.

Performance and Scaling Results

Model Configurations and Scaling

Three MatRIS-MoE variants—Small (S, 1.09B total params), Medium (M, 2.47B), and Large (L, 11.5B)—were benchmarked on multi-domain datasets comprising 473M atomic configurations (3.6T edges). All variants share a six-layer interaction block backbone with sparse MoE, multi-head attention, and task-aware embeddings.

Convergence studies demonstrate that, with appropriate learning rate scaling, batch sizes up to 15K can be stably accommodated, with negligible accuracy loss. Figure 4

Figure 4: Convergence behavior under varying batch sizes; robust optimization at scale.

Accuracy

Despite the complexity of the tasks (molecules, crystals, catalysis, MOFs), the 1B-parameter MatRIS-MoE achieves SOTA or near-SOTA across domains, delivering robust out-of-the-box accuracy without domain-specific retraining. Notably, the model maintains quantum-accurate prediction fidelity (energies, forces, magnetic moments) required for long-timescale and thermodynamically consistent MD simulations. Figure 5

Figure 5: Out-of-box cross-domain accuracy of MatRIS-MoE on diverse benchmarks; lighter color denotes higher accuracy.

Strong and Weak Scaling

Strong scaling experiments demonstrate that both MatRIS-MoE (M) and (L) maintain >50% parallel efficiency up to Exascale full-machine scale, with throughput scaling nearly linearly in weak scaling tests (1.2 EFLOPS on LineShine, 1.05 EFLOPS on CNIS, >90% parallel efficiency). Figure 6

Figure 6: Strong scaling—relative parallel efficiency and PFLOPS throughput at fixed problem size as machine scale increases.

Figure 7

Figure 7: Weak scaling—throughput improvement as both problem size and resources scale proportionally; near-linear performance up to 1.2 EFLOPS.

With system-level and hardware-specialized optimizations, per-step training time is reduced by 2.7–5×, and sustained performance, accounting for I/O and initialization, reaches 762 PFLOPS (CNIS) and 1.03 EFLOPS (LineShine).

Scientific and Practical Implications

The MatRIS-MoE model, enabled by Janus, provides a transformative simulation engine for a broad spectrum of high-value tasks:

  • Accurate polymorph ranking and structural relaxation with quantum-level agreement.
  • High-fidelity molecular dynamics (e.g., MOFs), sustaining structural integrity and preserving thermodynamic properties over long trajectories.
  • Direct catalytic energy profile prediction along reaction coordinates.
  • Robust generalization and speedup (orders of magnitude over DFT) for practical screening and discovery. Figure 8

    Figure 8: Applicability: (a) energy ranking, (b) molecular distribution functions, (c) relaxation, (d) molecular dynamics, (e) catalytic reaction profiles.

On the systems side, the work defines billion-parameter uMLIP training as a flagship AI-for-Science Exascale workload, highlighting the necessity—and limitations—of current HPC co-design. The roadmap for further scaling points to the continued need for algorithm–framework–system co-design: improved hardware utilization, memory compression for second-order graphs, and communication reduction. These elements will be critical as universal atomic foundation models expand beyond 10B parameters and underpin AI-driven discovery pipelines in clean energy, advanced materials, and pharmaceuticals.

Conclusion

The development of MatRIS-MoE and the Janus framework marks a significant departure from LLM-inspired architectures and frameworks ill-suited for second-order, high-fidelity scientific MLIPs. The demonstrated billion-parameter, multi-domain uMLIP efficiency and accuracy at Exascale redefine feasible model complexity, training acceleration, and cross-domain generalization for atomistic simulation. Future research will likely see scaling to even larger models, deeper integration with traditional simulation pipelines, and increasingly sophisticated hybrid HPC/AI programming frameworks to maximize efficiency, scientific impact, and accessibility.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 34 likes about this paper.