CrystalBoltz: Bayesian Protein Structure
- CrystalBoltz is a generative framework that integrates a diffusion-based Boltz-2 prior with experimental X-ray crystallography data via Bayesian inference.
- It employs differentiable, experiment-guided sampling to lower RMSD and R-factors while achieving a 33× speed-up relative to traditional refinement pipelines.
- The method automates local refinement of atomic coordinates and B-factors, ensuring compliance with crystallographic quality measures such as R-work and R-free.
CrystalBoltz is a generative framework for end-to-end protein structure determination from X-ray crystallography data, implementing Bayesian inference over atomic coordinates with direct conditioning on measured structure-factor amplitudes. Unlike traditional workflows, which rely on sequential manual refinement and limited integration of experimental data into generative models, CrystalBoltz unites a powerful diffusion-based prior (Boltz-2) with differentiable, experiment-guided sampling and automated refinement. The methodology achieves lower coordinate RMSD, lower R-factors, and significantly reduced turnaround time relative to previous state-of-the-art experimentally guided refinement pipelines (Kim et al., 15 May 2026).
1. Bayesian Formulation for Crystallography
CrystalBoltz frames protein structure determination as Bayesian inference of atomic coordinates (and isotropic B-factors ) given observed structure-factor amplitudes . The sequence- and space-group-conditioned posterior is
where:
- is the amino-acid sequence,
- encodes unit cell and space group,
- are observed amplitudes,
- is the learned Boltz-2 prior,
- is a differentiable likelihood provided by a structure-factor forward model.
This formalism unifies generative priors with experimental evidence, enabling direct sampling of plausible atomic models consistent with measured diffraction data.
2. Learned Generative Prior: Boltz-2 Diffusion Model
The generative prior, termed Boltz-2, is a sequence-conditioned denoising diffusion model over atomic coordinates, architecturally derived from AlphaFold3’s denoising network and trained on millions of publicly deposited PDB structures. The model learns to predict the score function under a variance-preserving stochastic differential equation:
0
Key features include:
- Diffusion/noise schedule with 1 steps and variance-preserving 2;
- Training to remove Gaussian noise from true protein structures, conditioning on the full sequence and optionally MSA or templates;
- Guaranteed physical plausibility in samples at 3 due to diffusion-based learning.
The Boltz-2 prior ensures that the output models are not only consistent with general biophysical constraints but also tailored to the specific target sequence.
3. Posterior Sampling Guided by Experimental Data
Posterior sampling is performed by converting the unconditional reverse SDE into a conditional SDE via addition of the crystallographic likelihood gradient:
4
The intractable 5 is approximated by evaluating crystallographic losses on the denoiser’s one-step prediction 6:
- Heteroscedastic Gaussian loss on normalized amplitudes (7-values),
- Rice distribution loss to account for unknown phases (distinguishing acentric/centric reflections).
Combined guidance is expressed as:
8
Empirically, guidance uses 9, 0, and 1. Each sampling step involves rigid-body alignment of 2 to a reference in the crystal frame, ensuring correct fractional coordinates during forward likelihood computation. This approach enables end-to-end generation of models that are simultaneously plausible under the learned prior and tightly consistent with measured diffraction data.
4. Atomic Coordinate and B-Factor Refinement
Upon completion of diffusion-guided sampling, CrystalBoltz enters a brief, local, high-resolution refinement of both atomic coordinates (3) and B-factors (4):
5
The refinement objective 6 is typically the crystallographic 7-factor or correlation coefficient (CC):
8
The forward model incorporates Debye–Waller factors 9 for B-factors, initialized from Boltz-2’s pLDDT via a Baek et al. mapping and clamped to 0 Å1. The refinement uses Adam for 2 steps, periodically re-solving scale and solvent parameters. This phase corrects side-chain rotamers and B-factor distributions to reach crystallographic-quality R-factors efficiently.
5. Experimental Benchmarks and Computational Performance
CrystalBoltz was evaluated across six PDB single-chain proteins (resolutions 1.69–2.20 Å; 164–306 residues): 8DWN, 4NTZ, 7O51, 7SEZ, 7VNX, and 1L63. The system was implemented in PyTorch on NVIDIA RTX A6000 GPUs, utilizing custom code for differentiable structure–factor calculation and integrating the Boltz-2 denoiser from the AlphaFold3 codebase.
A direct comparison with ROCKET [Fadini et al., 2026] reveals:
| Method | Total Runtime (min) | Key Steps |
|---|---|---|
| ROCKET | ∼376 | 3× MSA opt, long fine-tune, phenix.refine |
| CrystalBoltz | 11.3 | 10.9 (phase 1) + 0.4 (refine); 200 steps, 50–100 refine steps |
CrystalBoltz realizes a 33.3× speed-up relative to existing experimentally guided pipelines, reducing structure determination from hours to approximately 11 minutes per target.
6. Quantitative Results and Comparative Evaluation
Performance metrics include all-atom RMSD, C3 RMSD, 4, and 5. Table summarizing mean results (top 3 of 20 samples, values are improvements over best prior baseline):
| PDB | RMSD (Å) ↓ | 6 ↓ |
|---|---|---|
| 8DWN | 2.20 → 1.32 | 0.382 → 0.337 |
| 4NTZ | 8.77 → 1.30 | 0.554 → 0.483 |
| 7O51 | 1.125 → 0.651 | 0.381 → 0.278 |
| 7SEZ | 2.127 → 1.014 | 0.451 → 0.365 |
| 7VNX | 1.113 → 0.590 | 0.321 → 0.328 |
| 1L63 | 0.940 → 0.661 | 0.344 → 0.309 |
CrystalBoltz achieved statistically significant best performance on four of six proteins, with consistent improvements in both RMSD and R-factors compared to unguided Boltz-2 and ROCKET. This demonstrates the method’s effectiveness at directly integrating experimental data into generative structural workflows.
7. Limitations and Prospective Extensions
Current limitations include a reliance on rigid-body alignment to a reference from molecular replacement, introducing a dependency that could potentially be obviated by integrating alignment into inference. All experiments to date use single chains in the asymmetric unit, though both the Boltz-2 prior and forward model are inherently chain-agnostic, suggesting straightforward extensibility to complexes and oligomers.
The posterior sampling strategy employs diffusion posterior sampling (DPS) as a proof-of-concept; more advanced techniques (e.g., DAPS, dual-diffusion) could offer improved guidance, especially in high-noise or highly nonlinear regimes. The core experimental-conditioning paradigm—learned prior, differentiable forward model, guided diffusion, and local refinement—could be directly adapted to modalities such as cryo-EM and NMR.
As AI-refined models are deposited into public structure databases, explicit provenance metadata will be critical to support reproducibility and prevent feedback loops in model training.
CrystalBoltz demonstrates a highly integrated approach that unifies powerful data-driven priors with physics-based likelihoods, providing end-to-end sampling and refinement that simultaneously elevates accuracy and efficiency for X-ray crystallographic structure determination (Kim et al., 15 May 2026).