Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OpenProteinSet: Training data for structural biology at scale (2308.05326v1)

Published 10 Aug 2023 in q-bio.BM and cs.LG

Abstract: Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.

Citations (7)

Summary

  • The paper provides a comprehensive open dataset of over 16 million MSAs and PDB chain alignments, reducing the computational bottleneck in protein structure prediction.
  • It employs robust methods like JackHMMer and HHblits over extensive databases, achieving training parity with AlphaFold2 as shown by similar CASP15 scores.
  • The open-access resource democratizes bioinformatics research, accelerating advances in protein design, structure prediction, and multimodal AI integration.

Overview of "OpenProteinSet: Training data for structural biology at scale"

The paper "OpenProteinSet: Training data for structural biology at scale" addresses the significant challenge of generating large-scale datasets of Multiple Sequence Alignments (MSAs) required for training advanced machine learning models in structural biology. Authored by Gustaf Ahdritz and colleagues, this work introduces OpenProteinSet, a comprehensive and open-source repository of over 16 million MSAs, structural homologs from the Protein Data Bank (PDB), and AlphaFold2 protein structure predictions.

Background and Motivation

MSAs constitute a critical component in bioinformatics, particularly for tasks like protein structure prediction, protein design, and the development of protein LLMs. The advent of sophisticated models like AlphaFold2, which rely on massive quantities of raw MSAs to achieve near-experimental accuracy in structure prediction, underscores the centrality of MSA data. Nevertheless, the computational expense involved in generating MSAs at such a scale has limited their accessibility to the broader research community.

The creation of OpenProteinSet directly addresses this bottleneck by democratizing access to an extensive dataset of precomputed MSAs. This initiative facilitates progress in machine learning applications for proteins by providing essential training and validation data that were previously unavailable outside of few well-resourced research groups.

Composition and Methodology

OpenProteinSet comprises over 16 million unique MSAs, curated and generated using robust bioinformatic tools and protocols. Key highlights of the dataset include:

  1. PDB Chains: MSAs for all 140,000 unique PDB chains as of April 2022, generated using tools like JackHMMer and HHblits, with searches conducted over extensive sequence databases like BFD, UniRef90, and MGnify.
  2. Uniclust30 Clusters: MSAs for each cluster within Uniclust30, numbering to about 16 million. The dataset also includes a filtered, representative subset of approximately 270,000 clusters, optimized for training scenarios similar to those used in AlphaFold2.

Additional components include structural template hits computed against the PDB70 database and AlphaFold2 structure predictions for the filtered Uniclust30 subset. The generation of OpenProteinSet entailed over four million compute-hours, emphasizing the scale and thoroughness of the dataset.

Empirical Evaluation

The utility of OpenProteinSet was demonstrated through its application in training OpenFold, an open-source replication of AlphaFold2. The resulting model trained on OpenProteinSet achieved accuracy parity with the original AlphaFold2 across extensive evaluations. On the CASP15 dataset, OpenFold achieved a mean GDT-TS score of 73.8, which closely matched AlphaFold2’s score of 74.6, reflecting the robustness and reliability of OpenProteinSet as training data.

Implications and Future Directions

The introduction of OpenProteinSet has several notable implications:

  • Enhancement of Bioinformatic Research: By providing accessible, high-quality MSA data, OpenProteinSet empowers a broader spectrum of researchers to engage in cutting-edge protein machine learning projects.
  • Catalyzing Advancements in Protein LLMs: Models like MSA Transformer can now be trained on publicly available datasets of comparable scale and quality to proprietary sets used by leading research entities.
  • Facilitation of Multimodal AI Research: The dataset serves as a valuable resource for integrating protein data into larger multimodal AI models, thereby enriching their knowledge base and functionality.

Conclusion

OpenProteinSet is set to significantly impact numerous aspects of structural biology and bioinformatics by broadening access to high-quality training data for protein-related machine learning tasks. Its comprehensive and meticulously curated nature makes it an indispensable resource, fostering advancements in protein structure prediction, design, and beyond. Future directions may include periodic updates and expansions to the dataset, further augmenting its utility and keeping pace with the rapid growth in known protein sequences. The authors anticipate that OpenProteinSet will catalyze continued innovation in the domain of machine learning for structural biology.

X Twitter Logo Streamline Icon: https://streamlinehq.com