JARVIS-Leaderboard: Materials Design Benchmark

Updated 28 November 2025

JARVIS-Leaderboard is a large-scale, open benchmarking platform that standardizes evaluation processes for reproducible data-driven materials design.
It integrates diverse methodologies including AI, electronic structure, force fields, quantum computation, and experimental techniques with a rigorous, automated submission workflow.
The platform enforces strict submission requirements and CI-driven re-computation to provide transparent, community-verified performance comparisons across various materials classes.

The JARVIS-Leaderboard is a large-scale, open, and extensible benchmarking platform targeting reproducibility, fair comparison, and accelerated progress in data-driven materials design. Conceived and maintained as part of the Joint Automated Repository for Various Integrated Simulations (JARVIS) infrastructure at NIST, the JARVIS-Leaderboard supports benchmarking of artificial intelligence, electronic structure, force-field, quantum computation, and experimental methods across diverse materials classes and data modalities. It offers a rigorously standardized, community-driven evaluation ecosystem, underpinned by transparent data, code, automated scoring, and versioned comparisons, enabling reliable performance visibility and fostering best practices in the computational materials science community (Choudhary et al., 2023, Wines et al., 2023).

1. Motivation, Scope, and Objectives

The primary motivation for JARVIS-Leaderboard is the acute reproducibility crisis in materials science, with >70% of published results reported as non-reproducible. The field's inherent diversity—spanning quantum methods, ML/AI, force fields, and experimental techniques—complicates comparison and fosters silos. Without unified benchmarks, progress assessment and SOTA identification become intractable, leading to method overfitting and lack of transparency.

JARVIS-Leaderboard addresses these issues by:

Providing a common platform for benchmarking disparate methods and data modalities side-by-side.
Enforcing rigorous standards for submission (reference DOIs, split protocols, traceable metadata, reproducible scripts) to maximize reproducibility.
Allowing open, community-driven expansion through GitHub workflows, continuous integration, and open data publication (Choudhary et al., 2023, Wines et al., 2023).

2. System Architecture and Submission Workflow

JARVIS-Leaderboard is software-infrastructured through GitHub (usnistgov/jarvis_leaderboard) and organized in two principal directories: “benchmarks/” (reference datasets/task definitions) and “contributions/” (user-/team-submitted prediction results, metadata, scripts). All datasets are curated via JARVIS-Tools and externally hosted (e.g., Figshare) for stable access.

The submission workflow is as follows:

Download the benchmark definition via jarvis_populate_data.py, retrieve IDs and target values.
Run the candidate method on the specified data split to generate predictions.
Package results as results.csv.zip and record method/hardware/software provenance in metadata.json; optional Docker or YAML for environment reproducibility.
Locally validate results with jarvis_server.py.
Submit via jarvis_upload.py, which forks the repo, runs automated CI checks, and initiates a pull request.
Upon administrator validation, the submission is merged and the leaderboard website is automatically rebuilt with updated results (Choudhary et al., 2023).

The pipeline is engineered for transparent, auditable execution with CI-driven re-computation of metrics on every codebase update.

3. Benchmark Categories, Data Modalities, and Task Types

JARVIS-Leaderboard encompasses five major benchmarking domains:

AI: Structure-to-property regression/classification from atomic structures (JARVIS-DFT 3D, QM9), atomistic images (STEM/STM), spectra, and scientific text. Methods include both descriptor-based ML (CFID, MagPie, MatMiner), GNNs (ALIGNN, CGCNN, CHGNet, M3GNET), and LLMs (ChemNLP, OPT, GPT, T5).
Electronic Structure (ES): DFT (multiple functionals), many-body perturbation theory (GW₀, G₀W₀), quantum Monte Carlo, and tight-binding. Properties benchmarked include formation energies, band gaps, elastic moduli, phonon/optical spectra, superconducting critical temperatures, adsorption energies, and dielectric constants.
Force Fields (FF): Both classical (LJ, EAM, REBO, AMBER, CHARMM) and MLFFs (DeepMD, SNAP, ALIGNN-FF, M3GNET). Test quantities include energies, forces, stress tensors, mechanical moduli, adsorption isotherms, and free energy surfaces.
Quantum Computation (QC): Quantum algorithms (VQE, VQD) on Wannier or DFT-derived Hamiltonians, compared against analytic/classical values; measured by eigenvalue error, circuit depth/gate count, and simulation fidelity.
Experiments (EXP): Inter-laboratory round-robin measurements (e.g., CO₂ isotherms on ZSM-5, XRD, magnetometry, vibroscopy), to establish systematic baseline variability and cross-compare with computational predictions (Choudhary et al., 2023, Wines et al., 2023).

Sub-categories further specify the modality, e.g., SinglePropertyPrediction, ImageClass, EigenSolver.

4. Evaluation Metrics and Scoring Protocols

JARVIS-Leaderboard mandates standardized evaluation criteria:

Regression: Mean Absolute Error (MAE) is the canonical metric:

$\text{MAE} = \frac{1}{N} \sum_{i=1}^N |y_i - \hat{y}_i|$

Supplemented by RMSE, MSE, Pearson $r$ , and, for multi-output tasks, Multi-MAE (binwise or vector metrics).

Classification: Accuracy, $F_1$ , precision, and recall.
Text generation: ROUGE metrics.
Difficulty-normalized assessment: The ratio $\mathrm{MAD}/\mathrm{MAE}$ , where MAD is the mean absolute deviation from the mean, contextualizes score versus randomness.
Experimental/quantum tasks: Task-specific figures of merit (e.g., eigenvalue error, fidelity, reproducibility statistics).
Submission requirements: Results must match test split IDs and adhere to evaluation schema for each benchmark. Scoring is automated and version-controlled on every leaderboard update (Choudhary et al., 2023, Wines et al., 2023).

5. Representative Results and Methodological Landscape

JARVIS-Leaderboard curates and presents a wide spectrum of baseline and SOTA results:

Band gap (MatBench): CFID MAE ≈ 0.45 eV, ALIGNN ≈ 0.19 eV.
Phonon DOS: ALIGNN MAE <0.086 for 78% of test set, derived $C_V$ and $S_{vib}$ within 3% of DFT.
Superconducting $T_C$ : ALIGNN direct MAE ≈ 2.0 K; via Eliashberg MAE ≈ 1.4 K.
Force-prediction (Si): MLFFs (ALIGNN-FF, M3GNET) MAE ≈ 0.05 eV/Å; classical ≈ 0.5 eV/Å.
Tight-binding total energies: ThreeBodyTB RMSE ≈ 0.05 eV/atom (Wines et al., 2023).

Leaderboard visualizations enable direct, controlled comparison between classical, quantum, ML, and experimental methods—facilitating SOTA identification and highlighting regimes where different paradigms excel.

6. Technical Extensions and Integration with Automated Leaderboards

JARVIS-Leaderboard assimilates methodologies developed for NLP and computer vision leaderboard automation, leveraging extraction, ranking, and validation at scale:

Automated extraction (TDMS-IE): Frameworks for parsing tasks, datasets, metrics, and numeric scores from literature, using learned PDF-to-triple classification, BERT-based models, and self-attention mechanisms, can feed into JARVIS-style microservices for high-throughput leaderboard population (Hou et al., 2019).
Comparative table extraction and graph-based ranking: Techniques from performance-improvement graph construction yield robust leaderboards from heterogeneous literature sources, adapted for the materials domain by extending parser vocabularies and handling domain-specific units and multi-level headers. Graph-theoretic rankers (PageRank, Elo, TrueSkill) provide statistically principled rankings from noisy, cross-paper comparisons (Singh et al., 2018).
Leaderboard robustness: Adaptive mechanisms such as the Ladder algorithm provide theoretical guarantees against leaderboard leakage and over-submission bias, with both fixed and parameter-free variants that can be directly implemented in continuous benchmarking scenarios (Blum et al., 2015).

7. Impact, Current Status, and Outlook

As of the reporting date, JARVIS-Leaderboard hosts 1281 contributions to 274 benchmarks, spanning 152 methods and accumulating over 8.7 million evaluated predictions. Community engagement is operationalized via pull requests, with automated scoring and leaderboard regeneration. The platform's open, extensible design enables accommodation of new data modalities (video, 4D tomography), emerging method classes (e.g., hardware QC), and richer metrics (e.g., uncertainty, compute cost).

Integration with broader scientific workflows (e.g., JARVIS-Tools, AFLOW, Materials Project), as well as educational deployments in workshops and tutorials, positions JARVIS-Leaderboard as an indispensable resource for reproducible, transparent, and systematic materials-method benchmarking (Choudhary et al., 2023, Wines et al., 2023).

Future directions include expansion to mesoscale and multi-physics tasks, tighter coupling with automated leaderboards from other domains, and increased support for uncertainty quantification and interpretability metrics.

References:

(Choudhary et al., 2023) JARVIS-Leaderboard: A Large Scale Benchmark of Materials Design Methods
(Wines et al., 2023) Recent progress in the JARVIS infrastructure for next-generation data-driven materials design
(Hou et al., 2019) Identification of Tasks, Datasets, Evaluation Metrics, and Numeric Scores for Scientific Leaderboards Construction
(Singh et al., 2018) Automated Early Leaderboard Generation From Comparative Tables
(Blum et al., 2015) The Ladder: A Reliable Leaderboard for Machine Learning Competitions