La-Proteina: Atomistic Protein Modeling
- La-Proteina is a generative framework that disentangles explicit backbone representations from latent side-chain details for efficient protein design.
- It uses a partially latent model and flow matching to jointly generate coarse-grained backbones and fine atomistic features.
- The method enables scalable, high-fidelity, and motif-constrained protein designs with state-of-the-art structural validity and co-designability.
La-Proteina denotes an advanced generative modeling framework for atomistic protein design, built on partially latent flow matching to model the joint distribution over protein sequences and fully atomistic structures (Geffner et al., 13 Jul 2025). Distinct from previous approaches, it explicitly disentangles the modeling of protein backbone geometries from the fine structural and sequence-determining details, enabling high-fidelity, scalable, and versatile protein generation in both unconditional and motif-constrained settings.
1. Partially Latent Protein Representation
At the core of La-Proteina is a “partially latent” representation in which the protein backbone—specifically the Cₐ atom coordinates—is treated explicitly, while the side-chain details and amino acid sequence identity at each residue are encoded as continuous per-residue latent variables of fixed dimension (commonly , with and the sequence length). This approach resolves the core challenge of modeling variable-length side chains associated with different amino acid types:
- Backbone (Cₐ) coordinates: Serve as the coarse structural scaffold and are handled directly as real-valued vectors.
- Latent variables (): Contain, for each residue, information encoding the complete atomistic specification (side-chain, non-Cₐ atoms) as well as the categorical amino acid type. This sidesteps the need for variable-length handling and allows direct application of continuous generative methods.
- Sequence and side-chain atom recovery: The latent variables are decoded via a neural Variational Autoencoder (VAE) decoder to yield both side-chain atomic coordinates and residue identities.
The joint generative process is defined as
where is the flow-matching model over the explicit backbone and latent variables, and is the learned decoder (Geffner et al., 13 Jul 2025).
2. Flow Matching in the Latent Space
To model the generative process, La-Proteina leverages a flow-matching framework operating jointly over the explicit backbone and latent variables:
- Reference distribution: Samples are drawn from a factorized, typically Gaussian, prior .
- Target data distribution: Real protein structures (with decoded latent variables) define .
- Linear interpolation: The interpolation between reference and target for flow matching is given by:
where .
Separate time schedules (, ) are used for backbone and latent variables, sampled from carefully designed distributions for flexibility in modeling local (side chain) and global (backbone) features.
- Stochastic sampling: During generation, new protein backbones and latent vectors are sampled via SDEs:
The design incorporates trainable velocity fields for both the backbone and the latent channels, yielding an efficient and flexible process for generating full-atom protein structures (Geffner et al., 13 Jul 2025).
3. Decoder for Full-Atom Structure and Sequence
The decoding of atomistic detail is handled by a VAE decoder:
- Input: The VAE receives the generated backbone and the per-residue latent vectors .
- Output: It deterministically recovers all heavy atom coordinates (excluding Cₐ which are already generated) and the corresponding amino acid sequence (). Categorical outputs for sequence identity are sampled by argmax, while atom positions are taken as the mean of predicted Gaussians.
- Interpretation: This mechanism allows joint sequence–structure modeling without explicitly mixing continuous and discrete operations in the main generator loop.
This two-stage approach enables accurate recovery of physically plausible, designable protein structures from a reduced and more tractable latent space.
4. Performance Evaluation and Benchmarks
La-Proteina demonstrates state-of-the-art results across several metrics and benchmarks:
- All-Atom Co-Designability: The percentage of generated sequences whose predicted folded structure closely matches the generated backbone, measured via tools such as ESMFold. La-Proteina achieves superior co-designability compared to prior all-atom and partial-atom methods.
- Structural Validity: Evaluated using MolProbity clash scores, rotamer distributions (e.g., accurate reproduction of tryptophan χ₁ rotamers), and backbone geometry outlier statistics. Generated proteins attain low clash scores and physically realistic local geometric features.
- Designability and Diversity: The number of unique fold clusters represented, and the robustness with which sampled sequences designed to the generated structure robustly recover that structure. Measured using ProteinMPNN and cluster analysis, La-Proteina surpasses contemporaries, including Protpardelle and P(all-atom).
- Scalability: The framework remains performant for protein chain lengths up to 800 residues, a notable advance over previous models that collapse or fail on long sequences.
Performance is systematically demonstrated in comparative tables and figures, with high diversity and physical plausibility across varied length regimes (Geffner et al., 13 Jul 2025).
5. Atomistic Motif Scaffolding
A central application area is atomistic motif scaffolding—the integration of fixed, functionally essential motifs (active/binding sites) within de novo generated scaffolds:
- Conditioning on atomistic motifs: Unlike backbone-only constraints, La-Proteina allows the motif definition with full atomic coordinates, or with only tip atoms, to be provided as fixed in the scaffold.
- Indexed and unindexed tasks: Supports both known motif position (indexed) and unknown (unindexed) motif insertion; the latter requires simultaneously identifying motif placement and generating a compatible scaffold.
- Scaffolding success: Achieves high rates of sub-angstrom motif recovery (motif RMSD < 1 Å) and co-designability for scaffolded sequences in both indexed and unindexed, and both all-atom and tip-atom, regime.
- Comparison: Yields marked improvements over existing methods, especially when scaffolding multi-segment motifs and in atomistic quality.
These capabilities enable high-precision scaffold generation for applications in enzyme design, binder engineering, and therapeutic development.
6. Scalability and Robustness
La-Proteina exhibits notable robustness and scalability features:
- Efficient representation: By making atomistic detail latent rather than explicit, the approach avoids combinatorial blowup, maintaining efficiency as sequence length increases.
- Long-chain generation: Robust designability and validity up to 800 residues, supporting applications in large enzyme or structural protein design.
- Consistency across lengths: Empirical metrics for diversity, co-designability, and physical quality show stable trends irrespective of the chain length.
This scalability distinguishes La-Proteina among all-atom protein generators, which frequently encounter performance degradation as sequence length grows.
7. Architectural and Mathematical Framework
The architecture can be summarized by the following diagram and factorization:
The joint density factorizes as
The flow matching SDE for the generative process is
allowing for careful control of noise schedules and integration dynamics for both backbone and atomistic details.
La-Proteina defines a new direction in atomistic generative protein modeling with its partially latent architecture, efficient flow matching in a continuous space, and high-fidelity decoder. Its empirical performance sets a benchmark for unconditional and motif-constrained de novo protein design at atomistic resolution, particularly in scenarios requiring protein chain length scalability, structural fidelity, and functional flexibility (Geffner et al., 13 Jul 2025).