CloneBO: Bayesian Optimization for Antibody Design

Updated 3 February 2026

CloneBO is a Bayesian optimization framework that integrates a generative clonal model (CloneLM) and a twisted sequential Monte Carlo (tSMC) strategy to efficiently design antibody variants under constrained experimental budgets.
It leverages a high-parameter autoregressive transformer trained on large-scale immune repertoire data to model clonal family evolution and infer a fitness landscape for antibody evolution.
The framework employs Thompson sampling for candidate selection and has demonstrated superior performance in both in silico and in vitro studies by producing antibody variants with improved binding and stability.

CloneBO is a Bayesian optimization framework for efficient antibody design, integrating a generative model of clonal family evolution ("CloneLM") trained on large-scale immune repertoire data with a twisted sequential Monte Carlo (tSMC) posterior sampling strategy and a Thompson-sampling design policy. Developed to accelerate the discovery of antibody variants with improved binding and biophysical stability, CloneBO operates under tight experimental budgets, leveraging insights from natural immune system affinity maturation to guide experimental proposals (Amin et al., 2024).

1. Overview and Motivating Problem

The core problem addressed by CloneBO is the resource-constrained optimization of antibodies, where starting from an initial sequence $X_0$ (typically weak in binding or stability), only $N$ wet-lab measurements $Y_n = f(\hat X_n)$ can be performed for engineered variants $\{\hat X_n\}$ . Standard approaches are hampered by the vastness of functional antibody sequence space and the cost of assays. CloneBO proposes a data-driven approach that learns the process by which the adaptive immune system evolves clonal families—sets of related antibodies iteratively selected for higher biological fitness (binding and stability). The system integrates a generative model trained on natural clonal family data (CloneLM) into a Bayesian optimization loop, biasing variant proposals toward mutations likely to succeed in vivo.

2. The Generative Model: CloneLM

CloneLM is an autoregressive transformer model (377M parameters, Mistral-7B backbone) trained separately on human antibody heavy-chain and light-chain clonal families. Each training input is a serialized set of at least 25 sequences from a clonal family:

$[\texttt{SEP}, X_1, \texttt{SEP}, X_2, \dots, \texttt{SEP}, X_M]$

The training objective is next-token cross-entropy over all tokens, with negative log-likelihood:

$\mathcal{L}(\theta) = -\sum_{m=1}^M\sum_{t=1}^{|X_m|} \log p_\theta(X_m^{(t)}\mid X_m^{(<t)}, X_{<m})$

Training data comprises approximately 908,000 heavy-chain and 34,000 light-chain clonal families (from OAS, preprocessed by FastBCR) with test perplexities of 1.276 (heavy) and 1.267 (light). The model's martingale property ensures that its predictive distribution over new family members concentrates on the true latent clone as more sequences are observed:

$p_\theta(X_{M+1}\mid X_{1:M}) \approx \int p(X_{M+1}\mid\mathrm{clone})\,dp(\mathrm{clone}\mid X_{1:M}) \to p(X_{M+1}\mid\mathrm{clone}^*)$

Thus, the log-likelihood function

$F(X) = \log p(X\mid\mathrm{clone}^*) \approx \log p_\theta(X\mid X_{1:M})$

can be interpreted as an inferred fitness landscape for antibody evolution.

3. Bayesian Optimization Procedure

CloneBO's Bayesian optimization component models the experimental measurement $f$ as an affine-noisy observation of latent immunological fitness $F(X)$ . Assuming unknown scale $T$ and shift $C$ ,

$Y_n \mid F_n \sim \mathcal{N}(T\,F(\hat X_n) + C, \sigma^2)$

Uniform priors on $T$ and $C$ enable analytic computation of the marginal likelihood. The posterior over clones, which determines the fitness landscape, is proportional to

$p(\mathrm{clone}\mid X_0, Y_{1:N}, \hat X_{1:N}) \propto p(\mathrm{clone})\,p(X_0\mid\mathrm{clone})\,p(Y_{1:N}\mid F_{1:N})$

For sequence acquisition, CloneBO employs Thompson sampling: drawing a clone (instantiating a fitness function $F$ ) and proposing the next variant by maximizing $F$ via local search (up to $L$ single-site substitutions from the best $K$ candidates). This approach efficiently exploits the modeled fitness landscape for directed exploration.

4. Twisted SMC Posterior Sampling

Sampling from the posterior over clonal families is achieved using a twisted sequential Monte Carlo (tSMC) algorithm. The target distribution over family extensions is

$\tilde p_M(X_{1:M}\mid Y_{1:N}, \hat X_{1:N}, X_0) \propto p_\theta(X_{1:M}\mid X_0)\; p(Y_{1:N}\mid F^M_{1:N})$

where $F_n^M = \log p_\theta(\hat X_n\mid X_{0:M})$ . Sequences are constructed token-by-token with a proposal density:

$p_\theta(X_{M+1}^{(l)}\mid X_{0:M}, X_{M+1}^{(<l)}) \times p(Y_{1:N}\mid F_{1:N}^{M+1,(1:l)})$

Here, at each position $l$ , the pseudo-likelihood increment used for weighting is:

$F_n^{M+1,(l)} = \log \frac{p_\theta(X_{M+1}^{(l)}\mid X_{0:M},X_{M+1}^{(<l)},\hat X_n)}{p_\theta(X_{M+1}^{(l)}\mid X_{0:M},X_{M+1}^{(<l)})}$

Multiple particles ( $D=4$ in reported experiments) are propagated, weighted, and resampled based on effective sample size. After full sequence generation, final weight correction guarantees consistency with the correct posterior.

5. Integration of Experimental Data

To maintain computational efficiency, CloneBO conditions only on the $N_{\max}=75$ measured sequences $\hat X_n$ most likely to co-occur with $X_0$ under the unconditioned CloneLM prior. Each $\hat X_n$ contributes through the affine-Gaussian noise model marginal likelihood:

$R = \sqrt{N}\,\frac{\mathrm{Std}(Y_{1:N})}{\sigma} \mathrm{Cor}(F_{1:N}, Y_{1:N})$

$\log p(Y_{1:N}\mid F_{1:N}) = -\frac{1}{2}\log|\mathrm{Cov}(F_{1:N})| + \frac{1}{2}R^2 + \log\Phi(R) + C$

Incorporation of this information into the clone family generative prior yields the "twisted" distribution targeted by tSMC. As the number of sampled sequences $M\to\infty$ , this approach converges in total variation to the true posterior over clonal families.

6. Empirical Validation: In Silico and In Vitro Studies

CloneBO's efficacy is demonstrated both in simulated and experimental settings. In silico, using a held-out human clone as an oracle, CloneBO rapidly identifies high-fitness mutants even from $N=2$ measurements, whereas uninformed baselines exhibit limited improvement. On the VHH dataset (binding and melting temperature tasks with CNN/byte-level oracle predictors, cross-validated Spearman 0.72/0.95 for Tm/ $K_D$ ), CloneBO outperforms 10 competitor algorithms (including Greedy, Sapiens, LaMBO, LaMBO-Ab, Genetic, AdaLead, EvoBO, CMA-ES, Dyna-PPO, CbAS) after 100 generations, with statistically significant margins ( $p=0.018$ for binding, $p=0.006$ for stability). In the SARS-CoV CDRH3 binding task, CloneBO outperforms DiffAb, including cases where structural information is available, especially at low $N$ .

In vitro, from 1,000 initial measurements, 200 new antibody variants are designed with CloneBO and LaMBO-Ab as a comparator. Synthesis probability is evaluated using a trainable expressibility predictor (AUROC 0.87), with CloneBO's designs being significantly more synthesizable ( $p<10^{-5}$ ). Empirical binding (Bio-Layer Interferometry) and stability (nanoDSF) assays confirm that the top CloneBO-derived binders and stabilizers significantly outperform those from previous libraries and alternative design procedures ( $p=0.021$ for binding; best Tm exceeds all prior measurements).

7. Hyperparameters, Implementation, and Scalability

CloneBO utilizes a clone size $M=6$ and $D=4$ SMC particles per iteration. Conditioning is limited to the top $N_{\max}=75$ co-occurring prior sequences for computational tractability. Thompson sampling search explores up to $L=3$ single residue substitutions from the top $K=4$ current candidates. Gaussian noise parameterization is set to $\sigma = \tilde{\sigma}/\sqrt{N_{\max}}$ with $\tilde{\sigma}=0.25$ . CloneLM is trained using 4× A100 GPUs. Posterior sampling and design loop require approximately 10 hours for 100 design rounds on a single A100 GPU. Limiting conditioning set size ensures tSMC cost scales linearly. Potential future expansions include more scalable surrogates for sequence-activity summaries, batch proposal strategies, and multi-objective extensions for simultaneous optimization of binding and stability (Amin et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Bayesian Optimization of Antibodies Informed by a Generative Model of Evolving Sequences (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CloneBO.