Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BioNeMo Framework: a modular, high-performance library for AI model development in drug discovery (2411.10548v3)

Published 15 Nov 2024 in cs.LG and q-bio.BM

Abstract: Artificial Intelligence models encoding biology and chemistry are opening new routes to high-throughput and high-quality in-silico drug development. However, their training increasingly relies on computational scale, with recent protein LLMs (pLM) training on hundreds of graphical processing units (GPUs). We introduce the BioNeMo Framework to facilitate the training of computational biology and chemistry AI models across hundreds of GPUs. Its modular design allows the integration of individual components, such as data loaders, into existing workflows and is open to community contributions. We detail technical features of the BioNeMo Framework through use cases such as pLM pre-training and fine-tuning. On 256 NVIDIA A100s, BioNeMo Framework trains a three billion parameter BERT-based pLM on over one trillion tokens in 4.2 days. The BioNeMo Framework is open-source and free for everyone to use.

Summary

  • The paper introduces the BioNeMo Framework, a modular suite that doubles training throughput and scales AI models for drug discovery.
  • It employs advanced parallel computing and size-aware batching techniques to optimize GPU and memory utilization during model training.
  • Benchmark tests reveal near-linear scaling on up to 256 NVIDIA A100 GPUs and 96.9% efficiency on H100 systems, demonstrating its robust performance.

Overview of the BioNeMo Framework for AI Model Development in Drug Discovery

The BioNeMo Framework represents a sophisticated suite of open-source programming tools engineered for the advancement of AI-driven computational biology and chemistry, specifically targeted at drug discovery. As AI models become integral to high-throughput in-silico drug development, there is a significant emphasis on training models at scale using advanced parallel computing strategies. The BioNeMo Framework capitalizes on these needs by facilitating efficient AI model training across extensive GPU architectures, consequently optimizing both computational and memory resources.

Key Features and Components

The BioNeMo Framework is constructed as a modular, high-performance architecture consisting of independently installable components, referred to as sub-projects. This modularity ensures extensibility and ease of integration into existing workflows. Central to the framework is bionemo-core, which provides essential interfaces and foundational model-building tools based on PyTorch and the PyTorch Lightning framework. The framework also integrates the capabilities of NVIDIA's NeMo and Megatron libraries to support model scaling and optimization, especially in the development of large-scale BERT-based models for biomolecular applications.

Through comprehensive use cases, such as pre-training and fine-tuning protein LLMs (pLMs) like ESM-2 and Geneformer, the framework demonstrates superior efficiency. Notably, it achieves over twice the training throughput of equivalent PyTorch implementations, showcasing its high scalability—maintaining near-linear scaling when training across up to 256 NVIDIA A100 GPUs.

Benchmarks and Numerical Results

The performance benchmarks of the BioNeMo Framework are robust, as demonstrated by its training capabilities using a three billion parameter BERT-based pLM on over one trillion tokens within a 4.2-day time frame utilizing 256 NVIDIA A100 GPUs. A marked improvement in Model Flops Utilization (MFU) is observed, achieving 59.2% compared to 40.1% with the standard Accelerate library setup.

Additionally, in distributed training settings involving larger model sizes and H100 GPU systems, the framework supports significant parallel training speeds, reaching 96.9% efficiency of extrapolated single-node throughput. These empirical benchmarks underscore the framework's efficiency in handling large-scale data and model complexities, thus highlighting its capacity for accommodating intensive computational demands typical in drug discovery research.

Functionalities and Applications

The framework enhances productivity through multiple functionalities, including advanced data loading capabilities and size-aware batching. The implementation of components like BioNeMo-SCDL for single-cell data loaders, and WebDataModule for high-performance data streaming, illustrates its capability to handle diverse datasets efficiently. The size-aware batching module is particularly noteworthy for managing GPU memory utilization effectively during graph neural network (GNN) model training, thus addressing inherent data size disparities in molecular modeling tasks.

Community Contributions and Future Directions

As an open-source platform, the BioNeMo Framework encourages community-driven contributions to expand its capabilities within biomolecular domains. Real-world applications, such as the integration of fine-tuning tools by Dyno Therapeutics and novel model implementations by Flagship Pioneering and Relation Therapeutics, evidence its growing utility across diverse disciplines. The framework also supports cloud-based scalability processes, enabling organizations to leverage expansive computational resources such as AWS EC2 for large-scale inferencing in biopharma research.

Moving forward, the BioNeMo Framework anticipates continued optimization in its API for enhanced modularity and the facilitation of user contributions. The focus will be on evolving framework components to ensure the effective training and deployment of next-generation AI models across multidimensional computational biology challenges. The alignment with NVIDIA's broader software ecosystem suggests continuous performance enhancements, positioning the BioNeMo Framework as a pivotal asset in advancing AI applications in drug discovery and biological research.