Papers
Topics
Authors
Recent
Search
2000 character limit reached

CrystalBoltz: Bayesian Protein Structure

Updated 22 May 2026
  • CrystalBoltz is a generative framework that integrates a diffusion-based Boltz-2 prior with experimental X-ray crystallography data via Bayesian inference.
  • It employs differentiable, experiment-guided sampling to lower RMSD and R-factors while achieving a 33× speed-up relative to traditional refinement pipelines.
  • The method automates local refinement of atomic coordinates and B-factors, ensuring compliance with crystallographic quality measures such as R-work and R-free.

CrystalBoltz is a generative framework for end-to-end protein structure determination from X-ray crystallography data, implementing Bayesian inference over atomic coordinates with direct conditioning on measured structure-factor amplitudes. Unlike traditional workflows, which rely on sequential manual refinement and limited integration of experimental data into generative models, CrystalBoltz unites a powerful diffusion-based prior (Boltz-2) with differentiable, experiment-guided sampling and automated refinement. The methodology achieves lower coordinate RMSD, lower R-factors, and significantly reduced turnaround time relative to previous state-of-the-art experimentally guided refinement pipelines (Kim et al., 15 May 2026).

1. Bayesian Formulation for Crystallography

CrystalBoltz frames protein structure determination as Bayesian inference of atomic coordinates XX (and isotropic B-factors BB) given observed structure-factor amplitudes Fo|F_o|. The sequence- and space-group-conditioned posterior is

p(Xa,c,y)p(yX,a,c)p(Xa)p(X\mid a,c,y) \propto p(y\mid X, a, c)\,p(X\mid a)

where:

  • aa is the amino-acid sequence,
  • c=(u,G)c = (u, \mathcal{G}) encodes unit cell and space group,
  • y=Foy = |F_o| are observed amplitudes,
  • p(Xa)p(X \mid a) is the learned Boltz-2 prior,
  • p(yX,a,c)p(y \mid X, a, c) is a differentiable likelihood provided by a structure-factor forward model.

This formalism unifies generative priors with experimental evidence, enabling direct sampling of plausible atomic models consistent with measured diffraction data.

2. Learned Generative Prior: Boltz-2 Diffusion Model

The generative prior, termed Boltz-2, is a sequence-conditioned denoising diffusion model over atomic coordinates, architecturally derived from AlphaFold3’s denoising network and trained on millions of publicly deposited PDB structures. The model learns to predict the score function Xtlogpt(Xta)\nabla_{X_t} \log p_t(X_t \mid a) under a variance-preserving stochastic differential equation:

BB0

Key features include:

  • Diffusion/noise schedule with BB1 steps and variance-preserving BB2;
  • Training to remove Gaussian noise from true protein structures, conditioning on the full sequence and optionally MSA or templates;
  • Guaranteed physical plausibility in samples at BB3 due to diffusion-based learning.

The Boltz-2 prior ensures that the output models are not only consistent with general biophysical constraints but also tailored to the specific target sequence.

3. Posterior Sampling Guided by Experimental Data

Posterior sampling is performed by converting the unconditional reverse SDE into a conditional SDE via addition of the crystallographic likelihood gradient:

BB4

The intractable BB5 is approximated by evaluating crystallographic losses on the denoiser’s one-step prediction BB6:

  • Heteroscedastic Gaussian loss on normalized amplitudes (BB7-values),
  • Rice distribution loss to account for unknown phases (distinguishing acentric/centric reflections).

Combined guidance is expressed as:

BB8

Empirically, guidance uses BB9, Fo|F_o|0, and Fo|F_o|1. Each sampling step involves rigid-body alignment of Fo|F_o|2 to a reference in the crystal frame, ensuring correct fractional coordinates during forward likelihood computation. This approach enables end-to-end generation of models that are simultaneously plausible under the learned prior and tightly consistent with measured diffraction data.

4. Atomic Coordinate and B-Factor Refinement

Upon completion of diffusion-guided sampling, CrystalBoltz enters a brief, local, high-resolution refinement of both atomic coordinates (Fo|F_o|3) and B-factors (Fo|F_o|4):

Fo|F_o|5

The refinement objective Fo|F_o|6 is typically the crystallographic Fo|F_o|7-factor or correlation coefficient (CC):

Fo|F_o|8

The forward model incorporates Debye–Waller factors Fo|F_o|9 for B-factors, initialized from Boltz-2’s pLDDT via a Baek et al. mapping and clamped to p(Xa,c,y)p(yX,a,c)p(Xa)p(X\mid a,c,y) \propto p(y\mid X, a, c)\,p(X\mid a)0 Åp(Xa,c,y)p(yX,a,c)p(Xa)p(X\mid a,c,y) \propto p(y\mid X, a, c)\,p(X\mid a)1. The refinement uses Adam for p(Xa,c,y)p(yX,a,c)p(Xa)p(X\mid a,c,y) \propto p(y\mid X, a, c)\,p(X\mid a)2 steps, periodically re-solving scale and solvent parameters. This phase corrects side-chain rotamers and B-factor distributions to reach crystallographic-quality R-factors efficiently.

5. Experimental Benchmarks and Computational Performance

CrystalBoltz was evaluated across six PDB single-chain proteins (resolutions 1.69–2.20 Å; 164–306 residues): 8DWN, 4NTZ, 7O51, 7SEZ, 7VNX, and 1L63. The system was implemented in PyTorch on NVIDIA RTX A6000 GPUs, utilizing custom code for differentiable structure–factor calculation and integrating the Boltz-2 denoiser from the AlphaFold3 codebase.

A direct comparison with ROCKET [Fadini et al., 2026] reveals:

Method Total Runtime (min) Key Steps
ROCKET ∼376 3× MSA opt, long fine-tune, phenix.refine
CrystalBoltz 11.3 10.9 (phase 1) + 0.4 (refine); 200 steps, 50–100 refine steps

CrystalBoltz realizes a 33.3× speed-up relative to existing experimentally guided pipelines, reducing structure determination from hours to approximately 11 minutes per target.

6. Quantitative Results and Comparative Evaluation

Performance metrics include all-atom RMSD, Cp(Xa,c,y)p(yX,a,c)p(Xa)p(X\mid a,c,y) \propto p(y\mid X, a, c)\,p(X\mid a)3 RMSD, p(Xa,c,y)p(yX,a,c)p(Xa)p(X\mid a,c,y) \propto p(y\mid X, a, c)\,p(X\mid a)4, and p(Xa,c,y)p(yX,a,c)p(Xa)p(X\mid a,c,y) \propto p(y\mid X, a, c)\,p(X\mid a)5. Table summarizing mean results (top 3 of 20 samples, values are improvements over best prior baseline):

PDB RMSD (Å) ↓ p(Xa,c,y)p(yX,a,c)p(Xa)p(X\mid a,c,y) \propto p(y\mid X, a, c)\,p(X\mid a)6 ↓
8DWN 2.20 → 1.32 0.382 → 0.337
4NTZ 8.77 → 1.30 0.554 → 0.483
7O51 1.125 → 0.651 0.381 → 0.278
7SEZ 2.127 → 1.014 0.451 → 0.365
7VNX 1.113 → 0.590 0.321 → 0.328
1L63 0.940 → 0.661 0.344 → 0.309

CrystalBoltz achieved statistically significant best performance on four of six proteins, with consistent improvements in both RMSD and R-factors compared to unguided Boltz-2 and ROCKET. This demonstrates the method’s effectiveness at directly integrating experimental data into generative structural workflows.

7. Limitations and Prospective Extensions

Current limitations include a reliance on rigid-body alignment to a reference from molecular replacement, introducing a dependency that could potentially be obviated by integrating alignment into inference. All experiments to date use single chains in the asymmetric unit, though both the Boltz-2 prior and forward model are inherently chain-agnostic, suggesting straightforward extensibility to complexes and oligomers.

The posterior sampling strategy employs diffusion posterior sampling (DPS) as a proof-of-concept; more advanced techniques (e.g., DAPS, dual-diffusion) could offer improved guidance, especially in high-noise or highly nonlinear regimes. The core experimental-conditioning paradigm—learned prior, differentiable forward model, guided diffusion, and local refinement—could be directly adapted to modalities such as cryo-EM and NMR.

As AI-refined models are deposited into public structure databases, explicit provenance metadata will be critical to support reproducibility and prevent feedback loops in model training.

CrystalBoltz demonstrates a highly integrated approach that unifies powerful data-driven priors with physics-based likelihoods, providing end-to-end sampling and refinement that simultaneously elevates accuracy and efficiency for X-ray crystallographic structure determination (Kim et al., 15 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CrystalBoltz.