PDB-Struct: A Comprehensive Benchmark for Structure-based Protein Design (2312.00080v1)
Abstract: Structure-based protein design has attracted increasing interest, with numerous methods being introduced in recent years. However, a universally accepted method for evaluation has not been established, since the wet-lab validation can be overly time-consuming for the development of new algorithms, and the $\textit{in silico}$ validation with recovery and perplexity metrics is efficient but may not precisely reflect true foldability. To address this gap, we introduce two novel metrics: refoldability-based metric, which leverages high-accuracy protein structure prediction models as a proxy for wet lab experiments, and stability-based metric, which assesses whether models can assign high likelihoods to experimentally stable proteins. We curate datasets from high-quality CATH protein data, high-throughput $\textit{de novo}$ designed proteins, and mega-scale experimental mutagenesis experiments, and in doing so, present the $\textbf{PDB-Struct}$ benchmark that evaluates both recent and previously uncompared protein design methods. Experimental results indicate that ByProt, ProteinMPNN, and ESM-IF perform exceptionally well on our benchmark, while ESM-Design and AF-Design fall short on the refoldability metric. We also show that while some methods exhibit high sequence recovery, they do not perform as well on our new benchmark. Our proposed benchmark paves the way for a fair and comprehensive evaluation of protein design methods in the future. Code is available at https://github.com/WANG-CR/PDB-Struct.
- Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv, pages 2023–09, 2023.
- Protein sequence design with a learned potential. Nature communications, 13(1):746, 2022.
- De novo protein design by deep network hallucination. Nature, 600(7889):547–552, 2021.
- How cryo-em is revolutionizing structural biology. Trends in biochemical sciences, 40(1):49–57, 2015.
- The protein data bank. Nucleic acids research, 28(1):235–242, 2000.
- Design of protein-binding proteins from the target structure alone. Nature, 605(7910):551–560, 2022.
- Therapeutic approaches to protein-misfolding diseases. Nature, 426(6968):905–909, 2003.
- Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022.
- Knowledge-design: Pushing the limit of protein deign via knowledge refinement. arXiv preprint arXiv:2305.15151, 2023a.
- Pifold: Toward effective and efficient protein inverse folding. In International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=oMsN9TYwJ0j.
- Proteininvbench: Benchmarking protein inverse folding on diverse tasks, models, and metrics. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023c.
- A high-level programming language for generative protein design. bioRxiv, pages 2022–12, 2022.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Learning inverse folding from millions of predicted structures. ICML, 2022. doi: 10.1101/2022.04.10.487779. URL https://www.biorxiv.org/content/early/2022/04/10/2022.04.10.487779.
- The coming of age of de novo protein design. Nature, 537(7620):320–327, 2016.
- Generative models for graph-based protein design. Advances in neural information processing systems, 32, 2019.
- Illuminating protein space with a programmable generative model. BioRxiv, pages 2022–12, 2022.
- Learning from protein structure with geometric vector perceptrons. arXiv preprint arXiv:2009.01411, 2020.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
- Structure determination by X-ray crystallography, volume 233. Springer, 1977.
- Direct prediction of profiles of sequences compatible with a protein structure by neural networks with fragment-based local and energy-based nonlocal profiles. Proteins: Structure, Function, and Bioinformatics, 82(10):2565–2573, 2014.
- Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022.
- Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
- Joint generation of protein sequence and structure with rosettafold sequence space diffusion. bioRxiv, pages 2023–05, 2023.
- Antigen-specific antibody design and optimization with diffusion-based generative models for protein structures. 2022. URL https://openreview.net/forum?id=jSorGn2Tjg.
- Modeling protein structure using geometric vector field networks. bioRxiv, pages 2023–05, 2023.
- Mmseqs2 desktop and local web server app for fast, interactive sequence searches. Bioinformatics, 35(16):2856–2858, 2019.
- Colabfold: making protein folding accessible to all. Nature methods, 19(6):679–682, 2022.
- Size exclusion chromatography. Springer Science & Business Media, 1999.
- Spin2: Predicting sequence profiles from protein structures using deep neural networks. Proteins: Structure, Function, and Bioinformatics, 86(6):629–633, 2018.
- Cath–a hierarchic classification of protein domain structures. Structure, 5(8):1093–1109, 1997.
- Using alphafold to predict the impact of single mutations on protein stability and function. Plos one, 18(3):e0282689, 2023.
- Densecpd: improving the accuracy of neural-network-based computational protein sequence design with densenet. Journal of chemical information and modeling, 60(3):1245–1252, 2020.
- Cystic fibrosis: a disease of altered protein folding. Journal of Bioenergetics and Biomembranes, 29:483–490, 1997.
- Global analysis of protein folding using massively parallel design, synthesis, and testing. Science, 357(6347):168–175, 2017.
- Protein sequence and structure co-design with equivariant translation. arXiv preprint arXiv:2210.08761, 2022.
- Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 35(11):1026–1028, 2017.
- Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics, 31(6):926–932, 2015.
- Generative de novo protein design with global context. arXiv preprint arXiv:2204.10673, 2022.
- Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=6TxBxqNME1Y.
- Mega-scale experimental analysis of protein folding stability in biology and design. Nature, 620(7973):434–444, 2023.
- Highly accurate protein structure prediction for the human proteome. Nature, 596(7873):590–596, 2021.
- Language models generalize beyond natural proteins. bioRxiv, pages 2022–12, 2022.
- Scaffolding protein functional sites using deep learning. Science, 377(6604):387–394, 2022.
- Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. BioRxiv, pages 2022–12, 2022.
- High-resolution de novo structure prediction from primary sequence. BioRxiv, pages 2022–07, 2022.
- How significant is a protein structure similarity with tm-score= 0.5? Bioinformatics, 26(7):889–895, 2010.
- Improved protein structure prediction using predicted interresidue orientations. Proceedings of the National Academy of Sciences, 117(3):1496–1503, 2020.
- De novo design of luciferases using deep learning. Nature, 614(7949):774–780, 2023.
- Graph denoising diffusion for inverse protein folding. arXiv preprint arXiv:2306.16819, 2023.
- Pre-training via denoising for molecular property prediction. arXiv preprint arXiv:2206.00133, 2022.
- Tm-align: a protein structure alignment algorithm based on the tm-score. Nucleic acids research, 33(7):2302–2309, 2005.
- Structure-informed language models are protein designers. bioRxiv, pages 2023–02, 2023.