PhononBench: AI-Driven Phonon Benchmark

Updated 31 December 2025

PhononBench is a benchmark suite and dataset framework that standardizes the evaluation of phonon properties and dynamical stability in AI-generated crystal structures.
It employs universal ML interatomic potentials and finite-displacement phonon calculations to assess stability by confirming the absence of imaginary phonon modes.
The framework supports high-throughput screening and detailed benchmarking of both harmonic and anharmonic phonon properties, guiding advancements in AI-driven materials discovery.

PhononBench is a benchmark suite and dataset framework for standardized, large-scale, and physically rigorous assessment of phonon-related properties and dynamical stability in crystals, with a principal focus on AI-generated structures. PhononBench functions as both a repository and a protocol for computational materials discovery, leveraging universal machine-learning interatomic potentials (uMLIPs), finite-displacement phonon calculations, and algorithmic workflows to systematically classify the physical viability of a crystal by probing the absence of imaginary phonon modes—thereby assessing its proximity to a local minimum of the potential-energy surface rather than relying solely on thermodynamic criteria (e.g., formation enthalpy). The resulting infrastructure underpins the evaluation and guidance of generative models for crystal design, high-throughput stability screening, and even precision fundamental physics tests using phononic resonators.

1. Benchmark Foundations: Purpose, Metrics, and Scope

PhononBench establishes dynamical stability as the cardinal metric for evaluating whether a generated structure is truly a local minimum of the underlying potential, distinct from conventional thermodynamic metrics such as energy-above-hull. Dynamical stability is operationally defined by the absence of imaginary phonon frequencies throughout the Brillouin zone, as derived from diagonalizing the dynamical matrix $D_{ij}(\mathbf{q})$ : $D_{ij}(\mathbf{q}) = \frac{1}{\sqrt{m_i\,m_j}} \sum_{\ell} \Phi_{ij}(0,\ell) \, e^{-i\mathbf{q}\cdot\mathbf{R}_\ell}$ where $\Phi_{ij}(0,\ell)$ is the force-constant tensor, $m_i$ is the atomic mass, and $\mathbf{R}_\ell$ is a lattice vector. Crystal stability is declared when all $\omega^2(\mathbf{q}) = \text{eigenvalues}\bigl[D(\mathbf{q})\bigr]$ satisfy $\omega^2(\mathbf{q}) > 0$ for all $\mathbf{q}$ .

PhononBench uses the dynamical‐stability rate as its primary evaluation metric: $\text{stability rate} = \frac{\#\,\text{phonon-stable crystals}}{\#\,\text{successfully relaxed structures}} \times 100\%$ This metric normalizes over production, relaxation, and compliance differences among models and converges reliably with sample sizes exceeding ~4000 per model (Han et al., 24 Dec 2025).

2. Computational Methodologies: Datasets, Force Fields, and Workflow

PhononBench incorporates six leading generative models—CrystaLLM, MatterGen, DiffCSP, InvDesFlow-AL, CrystalFlow, and CrystalFormer—producing over 221,000 crystals, with 108,843 successfully relaxed and evaluated. Relaxation and phonon computations rely on the MatterSim universal machine-learning interatomic potential, a deep-learning model pretrained on 17 million DFT data points across the first 89 elements and a broad T,P range (0–5000 K, 0–1000 GPa) (Han et al., 24 Dec 2025, Loew et al., 2024).

Phonon spectra are extracted via the finite-displacement protocol (Phonopy): (1) each unit cell is expanded to a $2\times2\times2$ supercell; (2) atomic displacements of $0.01$ Å are applied; (3) MatterSim computes resultant forces; (4) forces are symmetrized into force-constant matrices $\Phi$ ; (5) phonon band structures are interpolated along high-symmetry paths. This pipeline achieves errors in phonon frequencies for MatterSim ( $\text{MAE}(\omega_{\text{max}}) = 17$ K) lower than inherent functional differences (PBE–PBEsol $\sim33$ K) and dynamical-stability classification accuracy of 95% (Loew et al., 2024).

A summary table illustrates the core dataset and workflow parameters:

Component	Methods/Models	Key Figures
Generative models	6 (CrystaLLM, MatterGen, etc.)	221,000 candidates, 108,843 relaxed
Evaluation engine	MatterSim (uMLIP, Phonopy)	17M DFT points, full BZ phonon spectra
Stability metric	Dynamical-stability rate	$25.83\%$ overall, $41.0\%$ for MatterGen
Reference data	DFT, Phonopy, OQMD	10,000–72,870 structures/crystals

3. Benchmarking of ML Interatomic Potentials for Phonons

PhononBench further systematizes benchmarking of universal MLIPs for harmonic and anharmonic phonon properties, thermodynamic parameters, and dynamical stability. Notable model comparisons include M3GNet, CHGNet, MACE, SevenNet, MatterSim, ORB, OMat24, and EquiformerV2 (Loew et al., 2024, Anam et al., 3 Sep 2025).

Phonon-related metrics comprise Mean Absolute Error in maximum phonon frequency ( $\text{MAE}(\omega_{\text{max}})$ ), entropy, Helmholtz free energy, and heat capacity, all with respect to ab initio references. Dynamical stability is further quantified via confusion matrices: True Stable (TS), False Unstable (FU), True Unstable (TU), False Stable (FS). MatterSim and EquiformerV2 models achieve near-DFT accuracy, with MatterSim recording $\text{MAE}(\omega_{\text{max}})=17$ K and TS accuracy $=95\%$ (Loew et al., 2024), while EquiformerV2(FT) yields the best 2nd-order IFC RMSE ( $1.16\,\text{eV/Å}^2$ ) and dynamic stability reproduction ( $81\%$ no-imaginary-modes) (Anam et al., 3 Sep 2025).

Key performance table (harmonic phonons, sample metrics):

Model	$\text{MAE}(\omega_{\text{max}})$ [K]	TS (%)
MatterSim	17	95
EquiformerV2(FT)	N/A	81
PBEsol (DFT)	33	97

A plausible implication is that force RMSE alone is insufficient; careful IFC fitting and stability testing are required for robust phonon screening.

4. High-Throughput Dynamical-Stability Assessment of AI-Generated Crystals

PhononBench exposes widespread failures in generative models to yield physically valid crystals. Across all generated structures, only 25.83% are dynamically stable. MatterGen tops the cohort at 41.0%; CrystaLLM yields only 3.0% (Han et al., 24 Dec 2025). In property-targeted generation (e.g., controlling band gap), dynamical-stability rates remain low (23.5% at $E_g=0.5$ eV; 11.6% at $E_g=4.5$ eV; overall 15.6%). For space-group-conditioned generation by CrystalFormer, higher symmetry (cubic) crystals reach 49.2% stability versus 17% in triclinic (average 34.4%).

PhononBench identifies 28,119 crystals as phonon-stable across the Brillouin zone. Elemental distribution in this pool matches expected chemical trends, e.g., dominance of O, Li, F and absence of noble gases, reflecting the biases in the generative models.

5. Standardization of Anharmonic Phonon Benchmarks

PhononBench protocols for anharmonic phonon characterization utilize irreducible-derivative extraction of force constants up to quartic order, "noiseless" IDMD and "real-world" DFTMD datasets, and rigorous training/validation of MLIPs including GAP, BPNN, and E(3)-equivariant GNNs (Bandi et al., 2024). The standardized suite incorporates benchmarking of phonon dispersion, self-energy diagrams (bubble, loop, sunset), linewidths, lineshifts, and BTE-derived thermal conductivity ( $\kappa(T)$ ), reporting RMSE metrics up to 5th order.

GNN models recover phonon dispersions within 1 cm $^{-1}$ , capture linewidths and lineshifts to within 5–15%, and estimate $\kappa(T)$ to within 10% of DFT—outperforming GAP and BPNN in all cases.

6. Algorithmic and Hardware Considerations: High-Performance Phonon Computation

PhononBench encompasses the benchmarking of algorithmic frameworks for large-scale phonon calculations. FourPhonon_GPU, an OpenACC-enabled extension of the FourPhonon package, is evaluated as a "PhononBench" for three- and four-phonon scattering rate and thermal-conductivity calculations across heterogeneous CPU–GPU platforms (Guo et al., 1 Oct 2025). Hybrid CPU–GPU workflows accelerate computation by 25× for scattering-rate kernels and 10× for full workflows, with phase-space filtering and channel enumeration performed on the CPU and massively parallel evaluation of WP and $\Gamma$ on GPUs. The mode-by-mode parallelization delivers up to 7× speedup with constrained memory, while "all-modes" offers higher throughput at the expense of >80 GB consumption on A100 GPUs.

A plausible implication is that perfect occupancy and minimal warp divergence on GPUs require intricate partitioning of irregular, symmetry-filtered channel lists.

7. Expansion to Fundamental Physics: Lorentz Invariance Tests

PhononBench protocol and hardware are further extended to precision tests of Lorentz invariance in the phonon sector via rotating quartz BAW resonators (Goryachev et al., 2018). By continuously comparing orthogonally mounted OCXOs on a turntable at 1 Hz, the setup achieves fractional frequency stability $\sigma_y(1\,\text{s})\sim1\times10^{-13}$ , with projected neutron SME coefficient sensitivity of $10^{-16}$ GeV (a 100-fold improvement over prior acoustic phonon experiments). The demodulated least-squares algorithm extracts harmonics of the normalized frequency difference and sets bounds on Lorentz-violating parameters. Future enhancements aim at cryogenic operation (Q $\sim10^9$ ) and expanded data sets to probe higher-dimension operators.

8. Prospects and Future Directions

PhononBench highlights systematic deficiencies in generative models, with low dynamical-stability rates prevailing. Suggested future directions include integration of phonon-based penalties and uMLIP feedback into generation/training objectives, co-optimization of thermodynamic and dynamical criteria, "on-the-fly" phonon screening in active learning, and exploration of coordinate and diffusion representations that better preserve potential wells (Han et al., 24 Dec 2025). For MLIPs, recommendations comprise expanding training to include supercell distortions, explicit IFC derivatives, physics-informed loss, and joint energy+force objectives, as well as protocols for up-to-3rd-order OLS IFC fitting and BTE solution with at least $10^8$ three-phonon channels (Anam et al., 3 Sep 2025).

PhononBench thus constitutes both a rigorous evaluation toolset and a repository for physically vetted crystal structures, offering a foundation for the next generation of physically viable AI-driven crystal design and materials discovery.