Materials Project Trajectory Dataset
- MPtrj dataset is a harmonized collection of DFT geometry optimization trajectories from Materials Project, facilitating advanced MLIP training and benchmarking.
- It standardizes four exchange-correlation functionals with rigorous convergence filtering and comprehensive metadata for reproducible atomistic modeling.
- The dataset supports efficient, large-scale experiments and fine-tuning of MLIPs, enhancing performance in relaxation and stability prediction tasks.
The Materials Project Trajectory (MPtrj) dataset, as integrated and distributed within the LeMat-Traj corpus, comprises a harmonized, high-quality collection of density functional theory (DFT) geometry optimization trajectories sourced from the Materials Project. It facilitates both large-scale training of machine learning interatomic potentials (MLIPs) and the benchmarking and fine-tuning of models for atomistic modeling tasks, with rigorous standards in data formatting, functional coverage, and metadata completeness (Ramlaoui et al., 28 Aug 2025).
1. Dataset Scope and Composition
The MPtrj dataset aggregates all full geometry-optimization trajectories published by the Materials Project under CC-BY-4.0. The subset as included in LeMat-Traj is stratified by exchange-correlation functional, capturing a wide range of crystal chemistries. The table below summarizes its size and breakdown by functional:
| Functional | Trajectories | Atomic Frames |
|---|---|---|
| PBE | 195,721 | 3,649,785 |
| PBESol | 39,981 | 309,873 |
| SCAN | 7,756 | 180,528 |
| r2SCAN | 37,888 | 516,576 |
Trajectories are grouped by chemical formula, and fields such as “chemical_formula” and “elements” permit stratification (e.g., by oxide, battery, or intermetallic classes). Notably, MPtrj covers a high representation of oxides (TiO₂, LiFePO₄, etc.) and battery-relevant compounds, providing a chemical counterweight to the bimetallic prevalence in other datasets.
2. DFT Calculation Parameters
All MPtrj trajectories originate from full crystal relaxations performed with DFT and harmonize four major exchange-correlation functionals: PBE, PBESol, SCAN, and r2SCAN. Computational parameters are standardized following the Materials Project API: norm-conserving pseudopotentials, Γ-centered k-point meshes (~25 k-points per Å⁻¹), and a plane-wave cutoff near 520 eV. Rigorous convergence filtering is enforced via:
- Energy difference between final and penultimate step ΔE ≤ 2×10⁻² eV.
- Maximum atomic force at the final step ∥F∥∞ ≤ 0.2 eV/Å.
These thresholds produce trajectories that are “reasonably converged,” yet retain moderate-force frames critical for training force-sensitive MLIPs.
3. Data Representation and Metadata
Dataset entries adhere to an extended OPTIMADE/JSON schema and are distributed as HuggingFace datasets in a JSON-Lines format, supporting interoperability and efficient downstream usage. Each atomic configuration record encodes:
- atomic_numbers :
- atomic_positions (Å) :
- cell (Å) : lattice matrix
- energy (eV) : total DFT energy
- forces (eV/Å) : , with
- relaxation_step (int) : step in trajectory
- relaxation_number (int) : coarse/fine re-relaxation index
- functional (str) : "PBE", "PBESol", "SCAN", or "r2SCAN"
- source (str) : "MaterialsProject"
- task_id or trajectory_id (str) : original run identifier
Energies may be recast per-atom, , to facilitate normalization and analysis across varying cell sizes.
4. Quality Filtering and Harmonization
Quality control is multi-layered:
- Frames missing energy or force data are excluded.
- Trajectories with ΔE(final–penultimate) > 2×10⁻² eV are discarded.
- Trajectories with final step ∥F∥∞ > 0.2 eV/Å are discarded.
- All entries are validated against the OPTIMADE schema.
Harmonization ensures all energies (eV), forces (eV/Å), and distances (Å) are consistent across the dataset. Frames are grouped by functional, permitting training of either functional-specific or multi-fidelity MLIP models.
5. Programmatic Access and Data Layout
MPtrj is accessible via direct HuggingFace datasets and through the LeMaterial-Fetcher Python API. Loading the Materials Project PBE subset, for instance, employs:
1 2 3 |
from lemater ial_fetcher import LeMatFetcher fetcher = LeMatFetcher() mptrj_pbe = fetcher.load_split(source="MaterialsProject", functional="PBE") |
Data organization follows a modular directory structure with Apache Arrow columnar files:
1 2 3 4 5 6 7 8 |
LeMat-Traj/ ├── PBE/ │ ├── mp_pbe.arrow │ ├── alexandria_pbe.arrow │ └── oqmd_pbe.arrow ├── PBESol/ ├── SCAN/ └── r2SCAN/ |
This enables efficient memory-mapped I/O for large-scale processing.
6. Practical Integration and Performance Benchmarks
MPtrj facilitates several atomistic modeling workflows:
- On Matbench Discovery benchmarks, a graph neural network (GNN) potential trained solely on MPtrj (PBE) achieves an F1 score ≃ 0.694, compared to 0.575 for high-force-only datasets like OMat24.
- Fine-tuning an OMat24-pretrained model on MPtrj/PBE increases stability-prediction F1 to ≃ 0.772, demonstrating its utility for near-equilibrium refinement.
- A recommended training workflow involves initial pre-training on a broad high-force molecular dynamics or active-learning dataset, followed by fine-tuning on MPtrj, to optimize low-force performance in geometry optimization.
For self-supervised learning, the “relaxation_step” field may be exploited for contrastive or masked-reconstruction objectives along optimization paths.
7. Significance and Recommended Practice
The MPtrj subset within LeMat-Traj delivers nearly 5 million converged, low-to-moderate force crystal frames, harmonized in a fully OPTIMADE-compliant schema across four DFT functionals. This design supports both functional-specific and multi-fidelity MLIP development, enabling:
- Fine-tuning of pre-trained models for relaxation tasks.
- Exploration of self-supervised and amortized-optimization algorithms.
- Efficient, large-scale reproducible experimentation leveraging standardized APIs and data representation (Ramlaoui et al., 28 Aug 2025).
A plausible implication is that the harmonized nature and broad functional coverage of MPtrj will facilitate cross-comparative studies of MLIP generalization and transferability across chemical and functional spaces.