High-Throughput Theoretical Screening

Updated 17 April 2026

High-throughput theoretical screening is a systematic computational approach that rapidly evaluates extensive libraries using automated workflows and ML models.
It integrates multi-fidelity simulations, stability assessments, and property evaluations to narrow down thousands of candidates to a manageable list for experimental validation.
It underpins innovations across energy, catalysis, electronics, and drug discovery by delivering efficient, scalable, and targeted candidate selection.

High-throughput theoretical screening (HTS) is a systematic, computational approach for evaluating large libraries of candidate compounds, materials, or molecular structures to identify those with optimal or novel target properties. Utilizing advanced simulation workflows, machine learning models, statistical algorithms, and automated job management, HTS enables exploration of vast chemical and structural spaces at scales and speeds unachievable by experiment alone. Its applications encompass materials discovery for energy, catalysis, electronics, drug development, and beyond, with the goal of rapidly narrowing down tens of thousands or more possibilities to a tractable list for costly experimental validation.

1. Concept and Motivation

HTS addresses the challenge of combinatorial explosion inherent in materials and molecular design. Traditional sequential experimental or computational searches are infeasible for exploring the immense phase spaces associated with multi-component compounds, intercalants, or organic libraries. The key insight is that by combining automated workflow management, rigorous property prediction (e.g., via DFT, DFPT, Boltzmann transport, ML surrogates) and application-specific filters, one can rapidly down-select from thousands to millions of hypothetical candidates to a list of tractable experimental targets—with systematic control over stability and functional criteria (Sahni et al., 2019, Hong et al., 2020, Jia et al., 2019, Chen et al., 2024).

In drug discovery and chemical biology, HTS originally referred to experimental screening in multiwell formats. Theoretical HTS replaces or augments this with computational modeling—using first-principles calculations, machine learning, or hybrid workflows for rapid scoring of structural libraries (Gurbych et al., 2020, Shterev et al., 2017, Smucker et al., 18 Feb 2026).

2. General Workflow Components

A canonical theoretical HTS pipeline comprises several stages. The specifics vary by domain but most pipelines include:

Compound Enumeration and Structure Generation: Enumeration of all possible compositions or topologies, e.g., all A–B–C element combinations and site orderings in Half-Heuslers (Sahni et al., 2019), or high-symmetry templates in quaternaries (Hong et al., 2020).
Initial Structural Relaxation and Filtering: Geometry optimization via DFT or force fields, followed by eliminative filters such as energy-above-hull, bandgap, and symmetry criteria (Qu et al., 2019, Garcia et al., 2019).
Stability Assessment: Quantification of chemical stability (formation enthalpy, ΔE_F), and dynamic/thermal stability (phonons, imaginary modes, entropy corrections). For disordered alloys, ensemble or Boltzmann-weighted averages over representative supercells are used (Garcia et al., 2019).
Property Evaluation: Application-specific calculations (band structures, thermoelectric figure of merit ZT, optical absorption, SHC, electron–phonon coupling, etc.) using hierarchical methods (standard DFT, hybrid functionals, GW corrections, statistical modeling, or ML regression), depending on required accuracy and throughput (Sahni et al., 2019, Qu et al., 2019, Li et al., 16 Sep 2025, Baumann et al., 11 Mar 2026).
Application-Driven Filtering and Ranking: Threshold-based selection on target figures of merit (e.g., ZT > 0.7 for thermoelectrics, bandgap ranges for photovoltaic or transparent conductor applications, SLME > 20% for solar absorbers, λ_Γ > 0.1 for superconductors) (Sahni et al., 2019, Chen et al., 2024).
Output and Curation: Compilation of ranked candidate lists annotated with stability, property, and potential application metadata for downstream validation and synthesis (Sahni et al., 2019, Qu et al., 2019).

3. Methodological Innovations

Multi-Fidelity and Adaptive Workflows

Resource allocation is optimized using multi-fidelity pipelines: low-cost, approximate methods are applied for initial filtering, while high-accuracy (but expensive) methods are reserved for shortlisted candidates. Frameworks for optimizing trade-off between accuracy and computational cost have been mathematically formalized using metrics such as return-on-computational-investment (ROCI), with pipeline thresholds set adaptively based on performance and computational budget (Woo et al., 2021).

Pipeline Example:

Stage	Method	Cost	Purpose
Stage 1	GGA-DFT	Low	Bulk filtering, band structure
Stage 2	HSE06	High	Bandgap correction
Stage 3	G₀W₀	Very high	Final gap, band inversion, SLME

(Adapted from (Sahni et al., 2019))

Combinatorial and Statistical Design

HTS in combinatorial chemistry and drug discovery leverages advanced statistical frameworks:

Supersaturated Experimental Designs: Matrix construction techniques (e.g., CRowS) optimize the aliasing structure of pooling designs, maintaining statistical power even with severe well-row constraints, and analyzed via sparse regression (Lasso) (Smucker et al., 2024).
Pooling and Group Testing: Classical group-testing theory (e.g., balanced random pools, Dorfman design) is adapted to maximize throughput and identification of active compounds under cost constraints (Smucker et al., 18 Feb 2026).
Bayesian Hierarchical Models: Plate-to-plate variation is adaptively shared using nonparametric Dirichlet processes, yielding robust hit probabilities and automatic false discovery rate (FDR) control (Shterev et al., 2017).

Proxy and Descriptor-Based Screening

Fast-to-compute proxies are adopted when direct ab initio property calculation is too expensive:

Magnetostructural Proxy: Root-mean-square lattice deformation under spin-polarization for mapping magnetocaloric effect strength (Garcia et al., 2019).
ML Bond-Length Surrogates: Gradient-boosted tree models for bond-length prediction, rapidly estimating volume change on intercalation (ΔV) for millions of battery materials, enabling up to 8× reduction in full DFT workload (Baumann et al., 11 Mar 2026).
Descriptor-Driven Thermoelectric Screening: Analytic descriptors (χ from effective masses and deformation potentials, γ from elastic moduli) correlate with power factor without solving full transport or phonon equations (Jia et al., 2019).

4. Application Domains and Case Studies

Selected representative applications demonstrate the breadth of HTS:

Renewable Energy Materials: Screening of 8-electron Half-Heusler alloys using staged GGA/HSE06/G₀W₀ approach for thermoelectric, solar harvesting, and topological properties: out of 960 candidates, 121 are thermodynamically and dynamically stable, with several leading to record ZT and SLME values (Sahni et al., 2019).
Thermoelectric Compounds: ∼3,100 real and hypothetical X₂YZM₄ quaternaries screened for ZT, with band and phonon descriptors guiding optimization and uncovering quasi-Dirac/heavy-fermion and phonon-hybridization motifs as drivers of high performance (Hong et al., 2020).
Dielectric Materials: Crystal structure prediction (USPEX) + DFT/DFPT screening delivers 33 ternary oxides with ε_poly ≥ 30 and E_g > 3 eV, surpassing industry paradigms such as BaTiO₃ and HfO₂ (Qu et al., 2019).
2D Spintronics: Workflow coupling VASP, Wannier90, and WannierBerri for SHC yields hundreds of high-SHC/topological insulators mapped in the 4486-compound 2Dmatpedia space, with mirror-plane symmetry identified as a key enhancer (Li et al., 16 Sep 2025).
Superconductors: Cascaded filters—composition, stability, metallicity, frozen-phonon EPC—deliver a concise set of 23 new/known boride superconductors; full Eliashberg function and T_c predictions confirm and extend existing experimental findings (Chen et al., 2024).
Battery Electrode Design: Screening ∼1.17 million transition-metal oxides/fluorides for low volume change on (de)intercalation with ML bond-length surrogates, enabling identification of 287 low-strain candidates verified by DFT (Baumann et al., 11 Mar 2026).
Polymer and Organic Materials: MPNNs trained on extended OPV-relevant datasets accelerate prediction of quantum properties for polymers, achieving chemical-accuracy MAEs without explicit 3D structure inputs (John et al., 2018).
Drug Discovery and Protein–Ligand Affinity: ML models (CatBoost, GANN, BERT) trained on ≈350,000 protein–ligand pairs predict K_i with MAEs of 0.51–0.60 in log₁₀ K_i, enabling rapid virtual hit identification at structure-free level (Gurbych et al., 2020).

5. Performance, Scalability, and Computational Considerations

HTS success depends on leveraging cost-efficient simulation strategies and scalable algorithms:

Job Management: Automated orchestration across HPC clusters, with modular pipelines for structure generation, input preparation, error checking, and post-processing (often via python tools such as pymatgen, custodian, CROW).
Approximate Methods: Restricting expensive calculations (e.g., GW, HSE06, full Brillouin-zone EPC) to pre-filtered candidates; using ML surrogates, analytical descriptors, or fast proxies to triage the library early (Sahni et al., 2019, Baumann et al., 11 Mar 2026).
Statistical Model Scalability: Hierarchical Bayes models and penalized regression (Lasso) scale linearly in number of screened wells or compounds, with careful truncation levels (H, K) for mixture components ensuring computational tractability even for millions of candidates (Shterev et al., 2017, Smucker et al., 2024).
Acceleration Examples: ML bond-length surrogates reduce DFT relaxations required by up to 8× (Baumann et al., 11 Mar 2026); SeA hybrid DFT methods yield 8×–80× speedups over conventional EXX for potentials and ML-force field generation (Ko et al., 2022).

6. Statistical, Machine Learning, and Design Aspects

HTS increasingly integrates contemporary statistical learning for both design and analysis:

Design Optimization: SSDs and other factorial/pooling designs are optimized for information content (minimum UE(s²)), balancing throughput and resource constraints (e.g., per-well limits in biological HTS) (Smucker et al., 2024).
Hit Identification and FDR Control: Bayesian posterior inference, penalized regression (elastic net, λ-specific Lasso), and cross-validation are standard in modern HTS pipelines for robust hit calling and error rate control (Shterev et al., 2017, Smucker et al., 18 Feb 2026).
Active Learning and Transfer: Surrogate models trained on initial DFT or experimental data can efficiently guide successive searches; pretraining on correlated properties accelerates learning curves in chemical property prediction (John et al., 2018).
Descriptor Engineering: Domain-informed descriptors (electrical, phononic, geometric, or electronic), either analytic or ML-derived, enable rapid screening where direct computation is prohibitive (Jia et al., 2019, Baumann et al., 11 Mar 2026).

7. Prospects, Best Practices, and Limitations

HTS is a foundational methodology in computational discovery, but several important considerations govern its adoption and utility:

Accuracy–Throughput Tradeoff: Early screening with approximate models must balance false negatives and computational savings; staged filtering with high-accuracy methods only on the final pool is recommended (Sahni et al., 2019, Woo et al., 2021).
Data Domain and Transferability: ML surrogates are limited to the chemical space spanned by their training set. Careful curation and validation are mandatory for reliable extrapolation (Baumann et al., 11 Mar 2026).
Uncertainty Modeling: FDR control in hit identification, error propagation across model chains, and uncertainty quantification in ML are crucial for prioritizing expensive follow-up (Shterev et al., 2017, Gurbych et al., 2020).
Design Constraints: Plate- or pool-based designs must respect experimental or combinatorial limitations; row-constrained SSDs are essential for large-scale HT biological, chemical, and materials screens (Smucker et al., 2024).
Rapid Iteration and Modularity: Modular scripting and integration with established materials informatics platforms (e.g., Materials Project, computational chemistry databases) facilitate workflow extension and ensemble modeling.

High-throughput theoretical screening thus enables accelerated, statistically principled discovery in diverse scientific domains. As computational resources, databases, and ML methodologies advance, HTS pipelines will increasingly underpin next-generation materials, drug, and molecular innovation (Sahni et al., 2019, Hong et al., 2020, Baumann et al., 11 Mar 2026, Gurbych et al., 2020, Smucker et al., 2024).