MACE4IR: ML Model for Molecular IR Spectroscopy
- MACE4IR is a machine-learning foundation model for molecular IR spectroscopy that uses symmetry-adapted message passing to ensure predictions obey rotational, translational, and permutation invariance.
- It is trained on 10 million DFT-computed molecular geometries across nearly 80 elements, providing robust performance and transferability for diverse chemical systems.
- The model supports both harmonic analysis and MD-based methodologies to predict IR spectra efficiently, offering DFT-like accuracy at a fraction of the computational cost.
MACE4IR is a machine-learning foundation model purpose-built for molecular infrared (IR) spectroscopy. It is constructed atop the MACE (Message passing with ACcurate and equivariant features) architecture, a message-passing neural network that rigorously enforces rotational, translational, and permutation symmetries inherent to molecular systems. MACE4IR is trained on a dataset of 10 million molecular geometries with DFT-computed energies, forces, and dipole moments sampled from the QCML dataset, covering nearly 80 different elements and a wide diversity of chemical space. The model enables accurate, efficient prediction of energies, forces, dipole moments, and full IR spectra—across organic, inorganic, and metal-containing molecules—at a computational cost vastly lower than DFT, offering a chemico-physical foundation model for experimental and theoretical spectroscopy in chemistry, biology, and materials science.
1. Model Architecture and Symmetry Principles
MACE4IR leverages the symmetry-adaptive MACE neural network architecture, itself an advanced equivariant message-passing framework. Each atom is encoded as a feature vector, with tensor-based message passing performed across atomic neighborhoods. The updates are designed so that all predictions are equivariant or invariant under the relevant physical symmetry groups—rotations, translations, and permutations.
Scalars such as total energies are modeled as (fully rotationally invariant) features. Vectorial and tensorial properties, such as atomic forces and molecular dipole moments, employ higher-order equivariant features (), i.e., their transformation rules mirror the physical action of rotations. The architecture decomposes information transfer into contributions classified by angular momentum quantum number , and implements deep multi-layered representations with explicit control over channel dimensionality and cutoff.
MACE4IR consists of two principal modules:
- MACE-EF (Energies and Forces): Predicts total potential energy and atom-wise forces. This enables its use as a general-purpose machine-learned interatomic potential (MLIP).
- MACE-D (Dipole Prediction): Predicts molecular dipole moments, necessary for IR activity and the calculation of absorption intensities.
Both modules are built on the same symmetry-adapted MACE philosophy but are trained independently on their respective targets from the same structural inputs.
2. Dataset and Training Procedure
Training utilizes a filtered subset of the QCML dataset, comprising 10 million geometries. The dataset's elemental coverage spans approximately 80 elements, with molecules sampled from diverse classes:
- Organic compounds and small biomolecules
- Inorganic clusters
- Organometallic and transition metal complexes
- Atmospheric and environmental molecules
For MACE-EF, the model is trained on DFT reference total energies and atom-resolved gradients (forces). MACE-D uses the same molecular geometries with corresponding DFT dipole moments as targets. The performance metrics are mean absolute errors (MAEs), which improve with model and training set size. For the largest model trained on the full dataset:
Quantity | Reported MAE |
---|---|
Energy | 2.1 meV/atom |
Forces | 30 meV/Å |
Dipole Moments | 23 mD (millidebye/cart) |
The presence of both energetic and dipolar constraints allows the joint architecture to encode force fields and spectroscopically-relevant observables with high fidelity.
3. Infrared Spectra Prediction Methodologies
MACE4IR predicts IR spectra through two principal methodologies, exploiting its joint learning of energies, forces, and dipoles:
- Harmonic Analysis:
- The vibrational frequencies and normal modes are obtained by diagonalizing the mass-weighted Hessian matrix, computed as the second derivative of the predicted total energy with respect to nuclear coordinates.
- IR intensities are derived from the derivative of the predicted dipole moment with respect to each normal coordinate. The analytic formula for integrated absorption intensity for mode is:
where is a mass-weighted normal coordinate, is the dipole vector, is Avogadro’s number, the speed of light, and the vacuum permittivity.
MD-Based Spectra:
- The model enables efficient molecular dynamics (MD) simulations by acting as an MLIP, generating long timescale molecular trajectories at DFT-like quality.
- Dipole moments are predicted at each time step; the time-autocorrelation function of these dipoles is Fourier transformed (consistent with the Wiener–Khinchin theorem) to produce IR spectra that inherently include anharmonic and temperature-dependent effects.
Predicted spectra by both methodologies match DFT calculations and experimental data, capturing both frequency positions and absorption intensities to high accuracy.
4. Computational Efficiency and Scaling
DFT methods, and especially AIMD, are computationally intensive: running a nanosecond simulation of a small biomolecule can require thousands of CPU hours. By comparison, MACE4IR, executed on GPU hardware, completes similar simulations in minutes to hours. The computational cost scales linearly, , in system size, whereas DFT-based approaches exhibit quartic or worse scaling for hybrid functionals. This efficiency enables high-throughput applications, large-molecule simulations, and rapid spectral screening.
5. Generality and Scope of Chemical Application
The extensive elemental and molecular coverage of the QCML training set—nearly 80 elements and broad chemical diversity—confers on MACE4IR robust transferability. The model performs well across:
- Elemental variety: Organics, metallo-organics, inorganics, and environmental molecules alike.
- Chemical environments: Different bonding motifs, coordination numbers, and charge states.
- Complex systems: Metal complexes, disordered systems, and molecules relevant for biomolecular and materials science problems.
The model is not restricted to a narrow chemical subspace, unlike most prior MLIPs for spectroscopy, enabling predictive capability across wide-ranging research domains.
6. Practical and Scientific Applications
MACE4IR is positioned to impact several scientific and technological fields owing to its efficiency and accuracy:
- Chemistry: Identification of unknown compounds, assignment of spectral bands, and analysis of reaction mechanisms via computed spectra.
- Materials Science: High-throughput screening of candidate materials, especially amorphous and nanostructured systems. Enables rapid exploration of chemical space in search of novel compounds with spectra-driven property targets.
- Biology: Modeling and assignment of IR spectra for biomolecules, including those too large or diverse for direct DFT treatment; important for protein-ligand interactions and conformational analyses.
- Spectroscopy-Driven Experimentation: Interpretation and deconvolution of complex experimental IR data.
A plausible implication is that the deployment of MACE4IR as a general-purpose molecular foundation model could standardize computational spectral assignment and enable new experimental–theoretical workflows previously inaccessible due to computational constraints.
7. Significance and Future Perspectives
The combination of symmetry-enforced architecture, broad and diverse quantum chemical training, and multifaceted output (energy, force, dipole) establishes MACE4IR as a state-of-the-art solution for rapid, accurate IR spectral prediction. These foundational capabilities suggest that MACE4IR could form the basis for modular expansion, such as coupling with generative design tools, inverse molecular design, or interpretation of experimental high-throughput spectroscopy.
Continued curation of larger, more diverse datasets may further enhance the model’s accuracy and generality. Additionally, integration into active learning and on-the-fly feedback frameworks opens pathways for autonomous discovery in chemistry and materials science where spectroscopic properties are central.
In summary, MACE4IR is a foundation model for molecular IR spectroscopy that unifies equivariant machine learning, extensive quantum chemical training, and physico-chemical interpretability, realizing practical and scientific advances for computational spectroscopy and beyond (Bhatia et al., 26 Aug 2025).