MLIP Arena: Physics-Based MLIP Benchmark

Updated 4 July 2026

MLIP Arena is an open benchmark platform defined for assessing machine learning interatomic potentials via simulation workflows rather than static regressions.
It emphasizes physics awareness, chemical reactivity, and stability under extreme conditions with evaluations including tortuosity, conservative-force checks, and energy jump analysis.
The platform supports diverse ASE-compatible models and flexible pipelines, enabling reliable validation for downstream molecular and materials simulations.

MLIP Arena is an open benchmark platform for machine learning interatomic potentials (MLIPs) that evaluates them as interatomic potentials for physics and chemistry rather than as regressors against a fixed density functional theory dataset. It is distributed as a Python package with an online leaderboard, and it emphasizes physics awareness, chemical reactivity, stability under extreme conditions, and predictive capability for thermodynamic properties and physical phenomena. Its defining premise is that low energy or force error on static reference data is not sufficient to establish reliability for molecular and materials simulation workflows, especially under distribution shift or in reactive and finite-temperature regimes (Chiang et al., 25 Sep 2025).

1. Rationale and scope

MLIP Arena was introduced in response to three limitations of prevailing MLIP evaluation practice: data leakage in standard train/validation/test splits, dependence on specific density functional theory references, and the weak connection between static regression errors and downstream simulation utility (Chiang et al., 25 Sep 2025). The platform therefore shifts evaluation from the question of whether a model reproduces a particular dataset to the question of whether it behaves like a physically sound potential across diverse structures, conditions, and workflows.

The benchmark is explicitly reference-agnostic in design. In some tasks, density functional theory remains part of the workflow, but typically to define symmetry properties, qualitative stability, or relative energetics rather than to provide a single universal target for mean absolute error minimization. The platform instead stresses smooth asymptotic behavior, conservative forces, rotational equivariance, stable molecular dynamics, and correct qualitative responses in defect migration, adsorption, vibrational stability, and phase-transition settings (Chiang et al., 25 Sep 2025).

This orientation places MLIP Arena within a broader move toward task-based validation of foundation MLIPs. A plausible implication is that it functions less as a replacement for conventional energy-and-force benchmarks than as a higher-level validation layer for models intended for deployment in atomistic simulation.

2. Platform architecture and benchmarking philosophy

MLIP Arena is implemented as an open Python package built around Atomic Simulation Environment calculators and databases, with workflows orchestrated through Prefect (Chiang et al., 25 Sep 2025). The benchmark definitions are therefore executable simulation pipelines rather than static evaluation scripts. Typical workflows include geometry optimization, equation-of-state scans, molecular dynamics under specified ensembles, transition-state searches, phonon calculations, and Widom insertion calculations.

The platform emphasizes fairness and transparency through standardized settings, open code, and multi-metric reporting. Hardware assumptions are documented, including runs on 1 CPU core and 1 A100 GPU with explicit timeouts and retries, so that runtime efficiency and scaling are benchmarked under controlled conditions rather than treated as anecdotal side information (Chiang et al., 25 Sep 2025). Intermediate states are cached, which makes large workflow suites reproducible and restartable.

A central design choice is multi-metric rank aggregation. Rather than collapsing performance into a single scalar, MLIP Arena combines rankings across several task-specific diagnostics, such as smoothness, monotonicity under compression, conservative-force deviation, dynamical stability, or classification accuracy for physical phenomena. This reduces the extent to which a model can be optimized for one narrow metric while failing badly on another (Chiang et al., 25 Sep 2025).

The platform is model-agnostic at the calculator interface. It includes ready-made adapters for several open-source and open-weight MLIPs, but any model that implements the ASE Calculator interface can be inserted into the same workflows and evaluated under the same conditions (Chiang et al., 25 Sep 2025).

3. Task families and diagnostic metrics

MLIP Arena organizes its evaluations into four broad categories: asymptotic analyses under off-equilibrium conditions, stability and reactivity from molecular dynamics simulations, robustness to distribution shifts, and thermodynamic properties plus phenomenological case studies (Chiang et al., 25 Sep 2025).

Category	Representative tasks	Representative metrics
Asymptotic analyses	WBM equation-of-state scans; homonuclear diatomic PECs	Tortuosity, energy jump, derivative flips, Spearman rank, conservative deviation
Stability and reactivity	RM24 NVT/NPT ramps; hydrogen combustion	Completed-trajectory fraction, valid steps, steps per second, water formation, reaction enthalpy, COM drift
Distribution shifts	NVE energy conservation; rotated-force tests	Differential entropy, sliding-window energy drift, equivariance MAE
Thermodynamics and phenomena	Vacancy NEB, MOF adsorption, 2D phonons, BaZrO $_3$ tilts	Path asymmetry, barrier asymmetry, adsorption classification, macro F1, PES shape

The asymptotic suite combines 1,000 crystalline structures from the WBM database with homonuclear diatomic potential-energy curves across the periodic table (Chiang et al., 25 Sep 2025). For bulk crystals, the platform performs 0 K optimization, then volumetric strains from $-20\%$ to $+20\%$ around the relaxed volume, and separately an energy–volume scan from $-49\%$ to $+75\%$ . The Birch–Murnaghan form is used as a reference model for the equation of state,

$E = E_0 + \frac{9BV_0}{16}\big[(\eta^2 - 1)^2(6 + B'(\eta^2 - 1) - 4\eta^2)\big], \quad \eta = (V/V_0)^{1/3}.$

For diatomics, interatomic distances are sampled from $0.9\,r_{\text{cov}}$ to $3.1\,r_{\text{vdw}}$ , or to 6 Å when no van der Waals radius is available (Chiang et al., 25 Sep 2025).

A key smoothness metric is tortuosity, defined for a one-dimensional potential-energy curve as

$\text{Tortuosity} = \frac{\sum_{r_i} |E(r_i) - E(r_{i+1})|} {|E(r_{\text{min}}) - E(r_{\text{eq}})| + |E(r_{\text{eq}}) - E(r_{\text{max}})|}.$

A smooth single-well potential has tortuosity 1, while larger values indicate oscillations, kinks, or numerical artifacts (Chiang et al., 25 Sep 2025). The benchmark also counts derivative flips and evaluates short-range monotonicity by Spearman rank coefficients in compressed regions. Conservative-force consistency is tested through

$\left\langle \left| \mathbf{F}(\mathbf{r})\cdot\frac{\mathbf{r}}{\|\mathbf{r}\|} + \nabla_r E \right|\right\rangle_{r = \|\mathbf{r}\|},$

which vanishes for a perfectly conservative force field (Chiang et al., 25 Sep 2025).

The molecular-dynamics suite uses RM24 random-mixture structures to probe finite-temperature and high-pressure robustness. NVT trajectories ramp from 300 K to 3000 K over 10 ps for 120 structures, and NPT trajectories ramp from 300 K to 3000 K and from 0 GPa to 500 GPa over 10 ps for 80 structures (Chiang et al., 25 Sep 2025). Runtime efficiency is reported as steps per second and fit to a power law,

$-20\%$ 0

to separate scaling exponent from prefactor. The reactive benchmark uses the gas-phase system

$-20\%$ 1

simulated for about 1 ns in NVT with $-20\%$ 2 steps and $-20\%$ 3 fs while tracking ignition behavior, water formation, $-20\%$ 4, COM drift, and throughput (Chiang et al., 25 Sep 2025).

Robustness to distribution shift is quantified with a differential-entropy measure over local atomic environments derived from QUESTS-like radial and bond-angle descriptors (Chiang et al., 25 Sep 2025). For a new environment $-20\%$ 5 relative to a reference set $-20\%$ 6, the surprise is

$-20\%$ 7

This is combined with 5 ps NVE runs and random-rotation tests to examine how energy conservation and equivariance degrade as a trajectory enters more surprising regions (Chiang et al., 25 Sep 2025).

The thermodynamic and phenomenological suite covers vacancy formation and migration in 57 FCC and 57 HCP elemental solids, CO $-20\%$ 8 adsorption in 20 MOFs, dynamical stability of 505 monolayers from C2DB, and a second-order phase transition in a $-20\%$ 9 BaZrO $+20\%$ 0 supercell (Chiang et al., 25 Sep 2025). For vacancy diffusion, path asymmetry is measured from a normalized NEB profile and barrier asymmetry is defined as

$+20\%$ 1

For MOF adsorption, Widom insertion yields

$+20\%$ 2

and predictions are also classified into the “general,” “post-combustion flue gas,” and “DAC” regimes (Chiang et al., 25 Sep 2025).

4. Model coverage and empirical findings

MLIP Arena natively supports a broad set of open-source models, including MACE-MP(M), MACE-MPA, MatterSim, SevenNet, EquiformerV2 variants, eSEN, eSCN, M3GNet, CHGNet, ORB, ORBv2, DeepMD, and ALIGNN, while remaining open to arbitrary ASE-compatible calculators (Chiang et al., 25 Sep 2025). This breadth is central to the benchmark’s function: it is designed to compare models with different architectural priors, including equivariant energy-based models, direct-force predictors, and universal materials foundation models.

A central empirical result is that no single model dominates all task categories (Chiang et al., 25 Sep 2025). On equation-of-state and diatomic smoothness tests, MACE-MPA, eSEN, MACE-MP(M), and MatterSim rank strongly on tortuosity, monotonicity, and conservative-force diagnostics, whereas ORBv2, ALIGNN, and some EquiformerV2 variants can exhibit energy jumps, gradient flips, or weak short-range repulsion. On RM24 runtime tests, ORBv2 is often the fastest model and MatterSim also performs strongly, but several models without sufficiently stiff short-range cores fail under the high-pressure NPT protocol.

The benchmark also isolates systematic failure modes that do not follow from conventional static error reporting. Direct-force models show large deviations between forces and energy gradients, strong growth of energy drift with differential entropy, and large COM drift in thermostatted reactive dynamics; in the hydrogen-combustion task, direct-force models can drift by about $+20\%$ 3 Å in COM over 1 ns, whereas conservative gradient-based models remain near $+20\%$ 4 Å (Chiang et al., 25 Sep 2025). Some models complete combustion trajectories but produce incorrect reaction enthalpies or fail to ignite; ORB and ORBv2 are specifically reported as very fast yet largely non-reactive in this system.

In defect migration and dynamical-stability case studies, MACE-MP(M) and MatterSim often provide the best balance of NEB symmetry, barrier quality, and 2D phonon classification performance, while MatterSim stands out in CO $+20\%$ 5-adsorption classification (Chiang et al., 25 Sep 2025). The 2D-materials task remains difficult for all tested models, with best macro F1 only around 0.42, and several models display a strong bias toward predicting instability. In the BaZrO $+20\%$ 6 tilt scan, MACE-MP(M), MatterSim, CHGNet, and SevenNet qualitatively reproduce the quartic-to-quadratic transition, whereas M3GNet and ORBv2 miss or distort essential symmetry features.

These outcomes reinforce the benchmark’s central argument: a model can be fast, or accurate on a standard dataset, yet still fail on reactivity, conservation, asymptotic behavior, or symmetry-sensitive materials physics (Chiang et al., 25 Sep 2025).

5. Relation to adjacent benchmark programs

MLIP Arena belongs to a broader family of benchmarking efforts that evaluate MLIPs inside simulation workflows rather than only on static train/test splits. In adsorption science, MLIP-MC defines an “adsorption arena” in which Widom insertion and GCMC are run with standardized MOFs, adsorbate, thermodynamic conditions, and protocols, making isosteric heats, isotherms, and interaction energies directly comparable across pluggable ASE backends (Edwards et al., 14 Feb 2026). That work showed that current universal MLIPs exhibit systematic adsorption biases and that training data composition, especially the presence of MOF–adsorbate and multi-adsorbate configurations, dominates architecture in determining performance.

For molecular kinetics, Landscape17 provides a testing and training arena based on complete kinetic transition networks for six multi-minimum molecules, including minima, transition states, steepest-descent paths, energies, forces, and Hessian eigenspectra (Cărare et al., 22 Aug 2025). Applied to contemporary architectures, it shows that all tested models miss over half of the reference transition states and generate stable unphysical structures across the potential-energy surface. Pathway-data augmentation improves global kinetics, but spurious minima remain abundant.

At the model-adaptation level, “Fine-tuning MLIP foundation models: strategies for accuracy and transferability” evaluates seven fine-tuning strategies across five chemically diverse benchmarks and three foundation-model generations, with an explicit focus on accuracy, forgetting, short-range repulsion, and transferability (Tompa et al., 10 Jun 2026). That study finds that foundation quality, correct $+20\%$ 7 initialization, and hyperparameters often matter more than the fine-tuning method itself; naive fine-tuning is usually best for narrow single-system deployment, whereas multihead replay is the only tested strategy that consistently preserves out-of-distribution robustness and many-body short-range repulsion.

At the workflow-automation level, MLIPilot frames MLIP optimization itself as an arena-like process in which tool-calling LLM agents edit training code, launch jobs, and are judged by a fixed physically constrained scorecard covering accuracy, drift, and throughput (Osaro et al., 29 May 2026). This extends the arena idea from model evaluation to auditable automated MLIP development.

Taken together, these efforts suggest an emerging ecosystem of specialized arenas: adsorption-specific, kinetics-specific, fine-tuning-specific, and now broad physics-and-chemistry evaluation under a single open platform.

6. Significance, limitations, and outlook

The significance of MLIP Arena lies in its reframing of what it means for an MLIP to be “good.” Rather than privileging one dataset or one scalar error, it tests whether a model is smooth, conservative, stable, equivariant where appropriate, reactive when necessary, and computationally efficient enough for realistic use (Chiang et al., 25 Sep 2025). In that sense, it serves as a benchmark for deployability as much as for predictive accuracy.

The platform also exposes a common misconception in MLIP evaluation: that success on static energy-and-force benchmarks implies readiness for production molecular dynamics or materials discovery. MLIP Arena demonstrates that this implication often fails under short-range compression, high temperature and pressure, reactive dynamics, OOD rotations, vacancy migration, adsorption, and phonon-based stability analysis (Chiang et al., 25 Sep 2025). This suggests that future model development will likely need stronger architectural priors, better treatment of asymptotic regimes, and training objectives that explicitly account for conservation, symmetry, and finite-temperature behavior.

The paper positions the platform as extensible. Proposed future directions include more diverse chemistries, coupling to experimental observables, active-learning loops guided by differential entropy, and additional phenomenon-oriented tasks such as crack propagation, dislocation mobility, and complex catalysis (Chiang et al., 25 Sep 2025). Given the platform’s ASE-and-Prefect structure, these extensions can be added as new workflows without changing the overall benchmarking philosophy.

In the broader development of MLIPs, MLIP Arena marks a transition from dataset-centric ranking toward behavior-centric validation. For foundation models intended for open-ended atomistic simulation, that transition is likely to be decisive.