Materials Project Trajectory Dataset

Updated 5 March 2026

MPtrj dataset is a harmonized collection of DFT geometry optimization trajectories from Materials Project, facilitating advanced MLIP training and benchmarking.
It standardizes four exchange-correlation functionals with rigorous convergence filtering and comprehensive metadata for reproducible atomistic modeling.
The dataset supports efficient, large-scale experiments and fine-tuning of MLIPs, enhancing performance in relaxation and stability prediction tasks.

The Materials Project Trajectory (MPtrj) dataset, as integrated and distributed within the LeMat-Traj corpus, comprises a harmonized, high-quality collection of density functional theory (DFT) geometry optimization trajectories sourced from the Materials Project. It facilitates both large-scale training of machine learning interatomic potentials (MLIPs) and the benchmarking and fine-tuning of models for atomistic modeling tasks, with rigorous standards in data formatting, functional coverage, and metadata completeness (Ramlaoui et al., 28 Aug 2025).

1. Dataset Scope and Composition

The MPtrj dataset aggregates all full geometry-optimization trajectories published by the Materials Project under CC-BY-4.0. The subset as included in LeMat-Traj is stratified by exchange-correlation functional, capturing a wide range of crystal chemistries. The table below summarizes its size and breakdown by functional:

Functional	Trajectories	Atomic Frames
PBE	195,721	3,649,785
PBESol	39,981	309,873
SCAN	7,756	180,528
r2SCAN	37,888	516,576

Trajectories are grouped by chemical formula, and fields such as “chemical_formula” and “elements” permit stratification (e.g., by oxide, battery, or intermetallic classes). Notably, MPtrj covers a high representation of oxides (TiO₂, LiFePO₄, etc.) and battery-relevant compounds, providing a chemical counterweight to the bimetallic prevalence in other datasets.

2. DFT Calculation Parameters

All MPtrj trajectories originate from full crystal relaxations performed with DFT and harmonize four major exchange-correlation functionals: PBE, PBESol, SCAN, and r2SCAN. Computational parameters are standardized following the Materials Project API: norm-conserving pseudopotentials, Γ-centered k-point meshes (~25 k-points per Å⁻¹), and a plane-wave cutoff near 520 eV. Rigorous convergence filtering is enforced via:

Energy difference between final and penultimate step ΔE ≤ 2×10⁻² eV.
Maximum atomic force at the final step ∥F∥∞ ≤ 0.2 eV/Å.

These thresholds produce trajectories that are “reasonably converged,” yet retain moderate-force frames critical for training force-sensitive MLIPs.

3. Data Representation and Metadata

Dataset entries adhere to an extended OPTIMADE/JSON schema and are distributed as HuggingFace datasets in a JSON-Lines format, supporting interoperability and efficient downstream usage. Each atomic configuration record encodes:

atomic_numbers : $[Z_1, Z_2, \ldots]$
atomic_positions (Å) : $[\mathbf{r}_1, \mathbf{r}_2, \ldots]$
cell (Å) : $3 \times 3$ lattice matrix
energy (eV) : total DFT energy
forces (eV/Å) : $[F_1, F_2, \ldots]$ , with $F_i = -\nabla_{\mathbf{r}_i} E[\{\mathbf{r}_j\}]$
relaxation_step (int) : step in trajectory
relaxation_number (int) : coarse/fine re-relaxation index
functional (str) : "PBE", "PBESol", "SCAN", or "r2SCAN"
source (str) : "MaterialsProject"
task_id or trajectory_id (str) : original run identifier

Energies may be recast per-atom, $E_{\rm atom} = E/N$ , to facilitate normalization and analysis across varying cell sizes.

4. Quality Filtering and Harmonization

Quality control is multi-layered:

Frames missing energy or force data are excluded.
Trajectories with ΔE(final–penultimate) > 2×10⁻² eV are discarded.
Trajectories with final step ∥F∥∞ > 0.2 eV/Å are discarded.
All entries are validated against the OPTIMADE schema.

Harmonization ensures all energies (eV), forces (eV/Å), and distances (Å) are consistent across the dataset. Frames are grouped by functional, permitting training of either functional-specific or multi-fidelity MLIP models.

5. Programmatic Access and Data Layout

MPtrj is accessible via direct HuggingFace datasets and through the LeMaterial-Fetcher Python API. Loading the Materials Project PBE subset, for instance, employs:

1
2
3

from lemater ial_fetcher import LeMatFetcher
fetcher = LeMatFetcher()
mptrj_pbe = fetcher.load_split(source="MaterialsProject", functional="PBE")

Data organization follows a modular directory structure with Apache Arrow columnar files:

LeMat-Traj/
 ├── PBE/
 │    ├── mp_pbe.arrow
 │    ├── alexandria_pbe.arrow
 │    └── oqmd_pbe.arrow
 ├── PBESol/
 ├── SCAN/
 └── r2SCAN/

This enables efficient memory-mapped I/O for large-scale processing.

6. Practical Integration and Performance Benchmarks

MPtrj facilitates several atomistic modeling workflows:

On Matbench Discovery benchmarks, a graph neural network (GNN) potential trained solely on MPtrj (PBE) achieves an F1 score ≃ 0.694, compared to 0.575 for high-force-only datasets like OMat24.
Fine-tuning an OMat24-pretrained model on MPtrj/PBE increases stability-prediction F1 to ≃ 0.772, demonstrating its utility for near-equilibrium refinement.
A recommended training workflow involves initial pre-training on a broad high-force molecular dynamics or active-learning dataset, followed by fine-tuning on MPtrj, to optimize low-force performance in geometry optimization.

For self-supervised learning, the “relaxation_step” field may be exploited for contrastive or masked-reconstruction objectives along optimization paths.

7. Significance and Recommended Practice

The MPtrj subset within LeMat-Traj delivers nearly 5 million converged, low-to-moderate force crystal frames, harmonized in a fully OPTIMADE-compliant schema across four DFT functionals. This design supports both functional-specific and multi-fidelity MLIP development, enabling:

Fine-tuning of pre-trained models for relaxation tasks.
Exploration of self-supervised and amortized-optimization algorithms.
Efficient, large-scale reproducible experimentation leveraging standardized APIs and data representation (Ramlaoui et al., 28 Aug 2025).

A plausible implication is that the harmonized nature and broad functional coverage of MPtrj will facilitate cross-comparative studies of MLIP generalization and transferability across chemical and functional spaces.

Markdown Report Issue Upgrade to Chat

References (1)

LeMat-Traj: A Scalable and Unified Dataset of Materials Trajectories for Atomistic Modeling (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Materials Project Trajectory (MPtrj) Dataset.