Bayesian Optimization for Antibody Engineering

Updated 30 September 2025

Antibody Bayesian optimization is a probabilistic approach that navigates vast antibody sequence–structure spaces using surrogate models to predict optimal properties.
It integrates techniques like protein language models, deep generative models, and multi-objective frameworks to balance affinity, stability, and developability.
The method reduces experimental costs and accelerates antibody engineering through iterative, uncertainty-driven candidate selection.

Antibody Bayesian optimization refers to the application of uncertainty-aware, sample-efficient search strategies for improving antibody properties, leveraging probabilistic models to guide the selection and evaluation of candidates across vast discrete sequence–structure spaces. Bayesian optimization (BO) in this context is characterized by the use of surrogate models—often Gaussian processes (GPs) adapted for biological sequence or structure data—to navigate the combinatorially challenging design domains inherent in antibody engineering, including the critical complementarity-determining regions (CDRs). Recent advances extend classical BO frameworks by integrating protein LLMs, structure-informed kernels, hierarchical objectives, generative priors, deep learning surrogates, and hybrid sequence-structure latent representations. These innovations have enabled substantial improvements in the efficiency, quality, and practicality of antibody design and optimization.

1. Bayesian Optimization Frameworks in Antibody Engineering

Bayesian optimization typically addresses tasks where exhaustive search for optimal antibody sequences is computationally infeasible due to the exponential growth of the design space (e.g., $20^L$ possible CDRH3 sequences for length $L$ ). The standard BO workflow constructs a probabilistic surrogate model, $f \sim \mathcal{GP}(\mu, k)$ , trained on observed pairs $(x_i, y_i)$ where $x_i$ denotes encoded antibody sequence (or structure) and $y_i$ is the measured property (e.g., affinity or stability). The surrogate predicts outcomes and quantifies uncertainty, enabling principled selection of future candidates by maximizing an acquisition function, such as expected improvement or noisy expected hypervolume improvement for multi-objective scenarios (Khan et al., 2022, Park et al., 2022, Gessner et al., 11 Jun 2024, Amin et al., 10 Dec 2024, Ober et al., 29 Sep 2025).

The BO paradigm has been adapted to:

Black-box optimization over discrete sequence spaces (AntBO) using categorical kernels such as the transformed overlap kernel, string kernels, and embedding-based kernels.
Multi-objective optimization where expression and affinity must be balanced and sometimes hierarchically ordered, as realized in the PropertyDAG framework (Park et al., 2022).
Active learning loops that iteratively refine surrogate models and guide simulator or experimental queries (Gessner et al., 11 Jun 2024).
Generative model-informed optimization, where priors are constructed from natural clonal family evolution to restrict the search to biologically plausible regions (Amin et al., 10 Dec 2024).

2. Surrogate Models and Input Representations

Surrogate modeling for antibody BO spans multiple designs:

Sequence-based models: One-hot encoding, BLOSUM substitution matrices, n-gram “motif” embedding, and protein LLM-derived representations (e.g., mean-pooled ESM-2, AbLang2, Sapiens) (Gessner et al., 11 Jun 2024, Ober et al., 29 Sep 2025).
Structure-based models: Encodings of predicted 3D structures (e.g., flattened C $_{\alpha}$ coordinates from IgFold), sometimes aligned to a parent antibody (Ober et al., 29 Sep 2025).
Hybrid approaches: Weighted or concatenated vector combinations of sequence and structure. Composite kernels fuse both sources, e.g., $k(x, x') = \pi k_{\text{struct}}(x, x') + (1 - \pi) k_{\text{seq}}(x, x')$ (Ober et al., 29 Sep 2025).

These representations are coupled with appropriate kernels: Tanimoto, Matérn-5/2, radial basis function, and string-matching kernels. A major finding is that sequence-only models, particularly when guided with a protein LLM (pLM) soft constraint, can match the late-round performance of structure-based or hybrid methods, though the latter sometimes show early-round data-efficiency gains, especially in stability optimization (Ober et al., 29 Sep 2025).

Protein LLM-based priors are integrated into BO as soft constraints: the acquisition function is multiplied by the pLM likelihood, $a_{\text{pLM}}(x) = \text{pLM}(x) \cdot a(x)$ , favoring natural-like, expressible antibodies (Ober et al., 29 Sep 2025).

3. Generative Model Integration and Optimization Strategies

State-of-the-art Bayesian optimization frameworks increasingly exploit deep generative models:

CloneBO uses CloneLM, an autoregressive LLM trained on clonal families, providing a probabilistic fitness prior proportional to $\log p(X|\text{clone})$ . The BO process is structurally informed by evolutionary dynamics, and the optimization leverages a twisted sequential Monte Carlo (SMC) procedure that conditions generative proposals on experimental feedback, achieving efficiency in both in silico and in vitro settings (Amin et al., 10 Dec 2024).
LEAD operates in a joint sequence-structure latent space, directly optimizing the latent code via black-box guidance (population sampling and log-likelihood gradient estimation), which is reminiscent of stochastic Bayesian optimization with gradient-free update strategies (Yao et al., 15 Aug 2025).
AbFlowNet fuses diffusion models with GFlowNet architectures, assigning rewards (binding energy estimates) to terminal states and propagating these using the Trajectory Balance objective across all diffusion steps, thus enforcing global optimization for both structure recovery and binding (Abir et al., 18 May 2025).
Alignment and Energy-based Objectives: AlignAb extends Direct Preference Optimization to Pareto-aligned multi-objective energy-based constraints (low repulsion, high attraction), with online iterative empirical alignment, temperature scaling for diversity, and explicit reward incorporation, yielding high-affinity, nature-like designs (Wen et al., 30 Dec 2024).

4. Multi-objective and Hierarchical Optimization

Multi-objective optimization is central in antibody BO, given the necessity to optimize for multiple developability and functionality criteria (e.g., expression, affinity, stability, specificity). PropertyDAG formalizes objectives in a directed acyclic graph, e.g., Expression $\to$ Affinity, with zero-inflated surrogate models that condition measurements on successful upstream properties (Park et al., 2022). Acquisition functions are resampled to reflect hierarchical dependencies, prioritizing candidates that jointly satisfy all criteria. Empirical results show that this leads to improved calibration and sample efficiency.

AlignAb employs Pareto-optimal energy alignment, combining explicit energy rewards (Lennard–Jones, Coulombic) across objectives with weighted collective rewards and temperature scaling, achieving improvements in diversity, structural recovery, and binding metrics (Wen et al., 30 Dec 2024).

5. Sequence-structure Co-design and Advanced Representations

Recent frameworks perform sequence-structure co-design to optimize CDRs within antibody frameworks:

AntiFold applies inverse folding to jointly generate sequence from fixed structure via geometric GNNs and transformer layers, producing candidates with high sequence recovery and structural fidelity, and assigning low probability to disruptive mutations. Its outputs can be used as acquisition scores in BO loops (Høie et al., 6 May 2024).
RADAb implements retrieval-augmented conditional diffusion, integrating structural homologous motif retrieval (MASTER algorithm) with denoising branches for global structure and local sequence context. Although not a BO per se, the probabilistic formulation allows integration with Bayesian acquisition strategies and uncertainty quantification (Wang et al., 19 Oct 2024).
RAAD leverages equivariant graph neural networks to co-design antigen-specific CDRs, encoding extensive node, edge, and relational features, and optimizing via iterative augmentation and a specificity-enhancing contrastive constraint. While not strictly Bayesian, avenues for surrogate modeling and acquisition-driven candidate selection are evident (Wu et al., 14 Dec 2024).

6. Computational and Experimental Considerations

Efficiency is a primary concern in antibody BO, as experimental or simulation-based queries (e.g., RBFE) are often expensive. Strategies include:

Trust region optimization (AntBO), which restricts candidate proposals to sequences within a bounded Hamming distance from known high-performing designs while enforcing developability constraints (Khan et al., 2022).
Dimensionality reduction and embeddings for scalability in GP surrogates; random projections for high-dimensional representations; kernel choices adapted to discrete sequence or structural domains (Gessner et al., 11 Jun 2024).
Iterative active learning, where the surrogate is continually updated and candidate decisions guided by exploration–exploitation trade-offs; demonstrated in both simulated and realistic “full-loop” experiments (Gessner et al., 11 Jun 2024).
Online iterative learning with temperature scaling (AlignAb), which maintains sample diversity and mitigates mode collapse (Wen et al., 30 Dec 2024).

Table 1: Key Surrogate Models and Their Domains (Ober et al., 29 Sep 2025)

Model name	Input domain	Kernel/Representation
OneHot-T	Sequence	Tanimoto (one-hot)
BLO-T	Sequence	Tanimoto (BLOSUM-62)
ESM-M	Sequence	Matérn-5/2 (ESM-2)
IgFold-M	Structure	Flattened coordinates
IgFold-ESM-M	Sequence+Structure	Concatenated vector
IgFold-BLO-T	Sequence+Structure	Kernel sum
Kermut-T	Sequence+Structure	Weighted kernel sum

7. Implications and Future Directions

The necessity of explicit structural information in BO is context-dependent: sequence-only models augmented with pLM soft constraints can match structure-based performance, particularly for affinity and late-stage optimization, while structure-based surrogates offer early data efficiency for stability (Ober et al., 29 Sep 2025).
Generative-model–driven BO accelerates optimization by leveraging evolutionary priors and active conditioning on empirical data, with practical advantages for therapeutic antibody discovery (Amin et al., 10 Dec 2024).
Hierarchical and multi-objective frameworks (PropertyDAG, AlignAb) enhance practical design by embedding experimental, biological, and physical dependencies.
The integration of probabilistic generative models, retrieval-augmented inversion, and GFlowNet-diffusion hybrids continues to expand capabilities in sequence-structure co-optimization and property-directed antibody engineering.
Open questions remain regarding the optimal fusion of sequence and structure representations, the calibration of acquisition functions under uncertainty, and the in vitro transferability of surrogate-guided rankings.

Antibody Bayesian optimization thus encompasses a rich field of methods, blending kernel-based surrogates, deep generative models, multi-objective heuristics, and evolutionary priors, with empirical advances driving improvements in sample efficiency, property prediction, and practical deployability. The convergence of classical Bayesian algorithms with modern machine learning architectures defines the current frontier in antibody engineering.