Papers
Topics
Authors
Recent
Search
2000 character limit reached

PepEDiff: Zero-Shot Peptide Binder Design

Updated 26 January 2026
  • PepEDiff is a zero-shot peptide binder generation framework that leverages denoising diffusion in a continuous ProtT5-derived latent space.
  • It bypasses intermediate structure prediction, enhancing sequence and structural diversity compared to traditional design pipelines.
  • Benchmark results on the BioLip dataset and case studies like TIGIT highlight its superior binding energy and broader exploration of latent binder space.

PepEDiff is a zero-shot peptide binder generation framework employing denoising diffusion in a continuous protein embedding latent space. Designed for the direct design of peptide binders to target receptor protein pockets, it operates without reliance on intermediate structure prediction, thereby increasing both sequence and structural diversity relative to structure-based design pipelines. PepEDiff leverages a frozen state-of-the-art protein LLM, ProtT5, to construct a high-dimensional latent binder manifold, and employs carefully formulated diffusion-based generative modeling to traverse this manifold well beyond the empirical distribution of previously known binders (Liang et al., 19 Jan 2026).

1. Protein Embedding Model and Latent Space

PepEDiff utilizes ProtT5 (Elnaggar et al., 2021) as a fixed encoder–decoder backbone. For any input protein sequence SS of length LL, the ProtT5 encoder outputs a matrix of per-residue embeddings z∈RL×dz \in \mathbb{R}^{L \times d} with embedding dimension d=1024d = 1024. For peptides, the decoder maps an embedding x0∈RL′×dx_0 \in \mathbb{R}^{L' \times d} (where L′L' is the peptide length) to amino acid probability distributions. The set of all x∈RL′×dx \in \mathbb{R}^{L' \times d} corresponding to valid peptide sequences defines the continuous "binder" latent-space XX over which generative modeling operates.

This approach separates PepEDiff from structure-conditioned methods by removing dependence on explicit structure prediction and allowing generative exploration throughout a semantic manifold derived from large-scale protein corpus pretraining.

2. Diffusion-Based Generative Modeling in Latent Space

PepEDiff adopts a denoising diffusion probabilistic model (DDPM) for generation in the protein embedding space, following the framework of Ho et al. (2020) and employing a cosine schedule for variance control (Liang et al., 19 Jan 2026).

2.1 Forward and Reverse Processes

  • Forward (Noising) Process qq: Given initial latent x0∼x_0 \sim data, the TT-step Markov chain is modeled as

q(xt∣xt−1)=N(xt;αtxt−1,(1−αt)I),q(x_t \mid x_{t-1}) = \mathcal{N}\big(x_t; \sqrt{\alpha_t} x_{t-1}, (1 - \alpha_t)I \big),

with αt\alpha_t determined from a cosine schedule.

  • Reverse (Denoising) Process pθp_\theta: The reverse process is parameterized as

pθ(xt−1∣xt,z,m,t)=N(xt−1;μθ(xt,z,m,t),σt2I),p_\theta(x_{t-1} \mid x_t, z, m, t) = \mathcal{N}( x_{t-1}; \mu_\theta(x_t, z, m, t), \sigma_t^2 I ),

where μθ\mu_\theta is a function of the predicted noise, and mm is a binary mask over receptor residues to highlight binding pockets.

  • Loss Function: Training minimizes a weighted combination of mean squared error and cosine similarity losses between the predicted and ground-truth noise vectors,

L=0.9â‹…LMSE+0.1â‹…Lcos,L = 0.9 \cdot L_{\text{MSE}} + 0.1 \cdot L_\text{cos},

ensuring the model learns both magnitude and angle alignment in the high-dimensional space.

3. Reverse Model Architecture

PepEDiff's architecture consists of:

  • Receptor Encoding: The target sequence is encoded with ProtT5. A self-attention block refines global context; a masked self-attention block (using mask m∈{0,1}Lm \in \{0,1\}^L) focuses attention on pocket residues, resulting in zpocket∈RL×dz_\text{pocket} \in \mathbb{R}^{L \times d}.
  • Peptide Denoiser: For the noisy peptide latent xtx_t, self-attention models intra-peptide dependencies. Cross-attention connects the peptide to zpocketz_\text{pocket} (pocket-focused context). Feed-forward layers and timestep/positional embeddings are incorporated as in standard transformer designs.
  • Output: The architecture predicts the noise to be removed at each diffusion step, εθ(xt,z,m,t)\varepsilon_\theta(x_t, z, m, t).

4. Zero-Shot Latent-Space Exploration

PepEDiff explicitly targets regions of latent space corresponding to binders not observed in empirical data ("zero-shot" region). This is operationalized as:

  • Define Xprotein⊃Xbinder⊃XpeptideX_\text{protein} \supset X_\text{binder} \supset X_\text{peptide}, with Xunseen=Xbinder∖XpeptideX_\text{unseen} = X_\text{binder} \setminus X_\text{peptide}.
  • Perturb latent representations of known peptides: For each peptide embedding xix_i, generate xi′=xi+σ⋅εx'_i = x_i + \sigma \cdot \varepsilon with ε∼N(0,Id)\varepsilon \sim \mathcal{N}(0, I_d) and σ\sigma adaptively increased if peptide decoding fails.
  • Artifacts are filtered by rejecting peptide sequences with >>50% identical residues or any contiguous repeat spanning >>30% of length.

This zero-shot protocol enables PepEDiff to generate candidates outside the distribution of observed binders, encouraging exploration of binder-relevant latent regions.

5. Inference and Sampling Procedure

The sampling algorithm comprises:

  1. Encode the receptor and mask to produce zpocketz_\text{pocket}.
  2. Initialize peptide latent xT∼N(0,I)x_T \sim \mathcal{N}(0, I).
  3. For t=Tt = T down to 1:
    • Predict noise εpred=εθ(xt,zpocket,t)\varepsilon_\text{pred} = \varepsilon_\theta(x_t, z_\text{pocket}, t).
    • Compute μθ\mu_\theta, sample xt−1=μθ+σtεx_{t-1} = \mu_\theta + \sigma_t \varepsilon.
  4. Decode x0x_0 with ProtT5 decoder to yield peptide sequence S^\hat{S}.
  5. Optionally, apply the zero-shot latent perturbations to further augment diversity.

6. Benchmarking and Empirical Evaluation

PepEDiff was benchmarked on the BioLip protein-peptide dataset after MMSeqs2 50% identity clustering, yielding 4,758 train, 546 validation, and 311 test examples. Competing methods included RFdiffusion+ProteinMPNN and DiffPepBuilder.

Metrics used:

Metric Description
Div_seq Average pairwise sequence dissimilarity [BLOSUM62 + Needleman–Wunsch]
Div_str Average pairwise 3D structure diversity [1–TM-score]
Div_emb Average pairwise mean ProtT5 embedding dissimilarity [1–cosine]
ΔG Predicted binding energy [Rosetta]

Test-set results (mean ± std):

Method Div_seq Div_str ΔG Div_emb
PepEDiff 0.67 ± 0.03 0.72 ± 0.15 −78.34 ± 72.82 0.41
RFdiffusion+MPNN 0.56 0.45 −67.99 0.27
DiffPepBuilder 0.44 0.54 −45.51 0.21

PepEDiff achieved higher sequence, structural, and embedding diversity, as well as more favorable binding energies. Embedding diversity gains were highly statistically significant (p≈10−45p \approx 10^{-45} to 10−6810^{-68}) (Liang et al., 19 Jan 2026).

7. TIGIT Case Study

The human TIGIT receptor (UniProt Q495A1, PDB 3Q0H) was used as a clinically relevant, challenging benchmark—characterized by a large, flat PPI interface without a druggable pocket.

  • Protocol: Each method generated 100 length-15 peptides; RFdiffusion+MPNN sampled 10 backbones ×\times 10 sequences.
  • Selected metrics (mean ± std):
Method Div_seq Div_str ΔG
PepEDiff 0.69 ± 0.09 0.80 ± 0.10 −30.49 ± 6.97
RFdiffusion+MPNN 0.45 0.14 −28.62
DiffPepBuilder 0.39 0.46 −20.02

Top candidate sequences from each method differed significantly; PepEDiff's LRISSDVHQDAASVH exhibited superior binding energy. Molecular dynamics with GROMACS and umbrella sampling protocols yielded van der Waals interaction energies and umbrella-sampling ΔG supporting PepEDiff's enhanced binding profile.

8. Strengths, Limitations, and Future Directions

Strengths

  • Structure-free generation: Avoids compounding errors of backbone →\rightarrow sequence cascades, producing sequence proposals directly in a learned semantic manifold.
  • Zero-shot exploration: Navigates outside empirical peptide space, potentially facilitating true innovation in binder design.
  • High diversity: Delivers superior sequence, structure, and latent diversity, correlating with improved computational binding affinity.

Limitations

  • Dependency on external evaluation: Structural and energetic assessment still require third-party predictors (Boltz-2 for structure, Rosetta for energy).
  • Scope: Currently operates on fixed-length, linear peptides without support for cyclic or post-translationally modified sequences.

Prospective Improvements

  • Joint end-to-end training for both sequence and backbone geometry in the embedding space.
  • Incorporating domain-specific constraints, such as solubility or cell-penetration.
  • Application of adaptive noise schedules or alternative score-based diffusion for enhanced sampling efficiency in XunseenX_\text{unseen}.

PepEDiff constitutes a general, structure-free, and zero-shot capable platform for peptide binder discovery, validated across both standardized benchmarks and challenging case studies (Liang et al., 19 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PepEDiff.