PepEVOLVE: Dynamic Peptide Optimization

Updated 28 November 2025

PepEVOLVE is a position-aware, dynamic peptide optimization framework that enhances macrocyclic peptide lead discovery by addressing combinatorial challenges.
It employs dynamic pretraining strategies, including stochastic masking and CHUCKLES shifting, to improve model generalization and avoid overfitting.
Integrating a multi-armed bandit router with an evolving reinforcement learning loop using group-relative advantage, PepEVOLVE efficiently navigates multi-objective design spaces.

PepEVOLVE is a position-aware, dynamic peptide optimization framework designed for multi-parameter exploration and optimization of macrocyclic peptides. Addressing the limitations of prior generative approaches, such as the necessity for chemist-specified mutable positions and static optimization protocols, PepEVOLVE employs a novel combination of dynamic pretraining, automatic site selection via a multi-armed bandit router, and evolving optimization through group-relative advantage (GRA) to efficiently navigate the combinatorial and multi-objective challenge of peptide lead discovery (Nguyen et al., 21 Nov 2025).

1. Motivation and Background

The optimization of macrocyclic peptides is challenged by a vast combinatorial design space and nonlinear, multi-parameter objectives (MPOs) encompassing potency, solubility, permeability, and pharmacokinetics. For example, a 12-mer constructed from 4,000 possible monomers yields up to $4{,}000^{12}$ candidates—a scale beyond the reach of enumerate-and-score methods, which are further constrained by vendor libraries (e.g., “top 20” per position yields %%%%1%%%% sequences). Multi-objective constraints interact nonlinearly, precluding brute-force optimization.

Machine learning-based generative models, including VAEs, GANs, RNNs, diffusion models, transformers, and LLMs, have been applied to propose de novo modifications and, in combination with reinforcement learning (RL), genetic algorithms, or MCTS, to traverse these MPO landscapes. PepINVENT, a key precursor, utilizes CHUCKLES (a SMILES-like tokenizer) and a transformer backbone, but is limited by its static masking (overfitting), chemist-specified edit sites (no automatic discovery), and static input protocols during RL (Nguyen et al., 21 Nov 2025).

PepEVOLVE addresses these deficiencies by introducing (i) dynamic pretraining via stochastic masking and rotational invariance, (ii) an automatic, context-free multi-armed bandit router to discover “where” to edit, and (iii) an evolving RL loop employing GRA to stabilize “how” edits are optimized under MPO constraints.

2. Dynamic Pretraining Strategies

2.1 Dynamic Masking

To circumvent the overfitting inherent in static masking, PepEVOLVE implements dynamic masking in which, for a peptide of length $L$ , the number of masked positions per epoch is drawn as

$n_\text{mask} = \mathrm{round}\bigl(T(1,\, 0.4L,\, 0)\bigr)$

with %%%%3%%%% a triangular distribution on $[a, b]$ with mode $c$ , biasing toward single-site masks. These positions are selected uniformly at random and replaced with a special “?” token. This expands the diversity of the reconstruction task during pretraining.

2.2 CHUCKLES Shifting

To ensure robustness and prevent the model from memorizing absolute token positions, CHUCKLES-shifted augmentations are utilized. For a cyclic peptide with monomers $m_1 \mid m_2 \mid \ldots \mid m_L$ , a random rotation $k \sim \mathrm{Uniform}\{1, \ldots, L\}$ is applied each epoch: $m_1 \mid m_2 \mid \dots \mid m_L \longrightarrow m_k \mid m_{k+1} \mid \dots \mid m_L \mid m_1 \mid \dots \mid m_{k-1}$ This procedure treats all rotationally equivalent configurations as identical, enforcing invariance.

2.3 Pretraining Objective

The learning objective is the minimization of negative log-likelihood over the pretraining data: $\mathcal{L}_\mathrm{NLL}(\widetilde p, \hat p) = -\sum_{t=1}^T \log f_\theta(\widetilde p, \hat p_{0:t-1})[\hat p_t]$ where $\widetilde p$ is the masked source and $\hat p$ is the target.

3. Automatic Edit Site Selection via Multi-Armed Bandit Router

The router algorithm formalizes each of the $L$ peptide positions as an “arm” in a context-free multi-armed bandit. On each episode, it samples a subset $I \subseteq [L]$ , $|I| = K$ , of positions to be masked and edited. For each subset $I$ , the generator $f$ proposes $G$ candidates $\{\hat p^g_I\}_{g=1}^G$ , each evaluated for scalar reward $R^g_I$ . Rewards per subset are averaged: $\bar R_I = \frac{1}{G}\sum_{g=1}^G R^g_I$

The router's policy is parameterized by $\theta \in \mathbb{R}^L$ , defining a categorical distribution $\pi_\theta$ over $K$ -subsets. Policy-gradient updates use the REINFORCE algorithm with entropy regularization: $\mathcal{L}_\mathrm{router} = -\frac{1}{B}\sum_{b=1}^B A_{I_b}\log \pi_\theta(I_b) - \beta \mathcal{H}(\pi_\theta)$ where $A_{I_b} = \bar R_{I_b} - b_\text{old}$ is the advantage, $b_\text{old}$ is a moving-average baseline, $\beta$ is an annealed entropy coefficient, and $\mathcal{H}$ denotes Shannon entropy. This process concentrates probability on position subsets that yield higher multi-objective rewards.

4. Evolving Optimization and Group-Relative Advantage

PepEVOLVE employs an evolving optimization architecture that iteratively refines peptide candidates using group-relative advantage, stabilizing RL updates across heterogeneous seed groups.

Given initial seeds $\{\tilde p^j\}_{j=1}^K$ , the process per iteration is as follows:

For each seed $j$ and context mask, generate $G$ candidates $\{\hat p^j_g\}_{g=1}^G$ .
Compute $R(\hat p^j_g)$ .
Calculate within-group statistics:

$\bar R^j = \frac{1}{G}\sum_{g} R(\hat p^j_g), \quad \sigma_R^j = \sqrt{\frac{1}{G} \sum_{g} (R(\hat p^j_g) - \bar R^j)^2}$

Compute group-relative advantage:

$A^j_g = \frac{R(\hat p^j_g) - \bar R^j}{\sigma_R^j + \varepsilon}$

Update the generator $f_\theta$ via loss:

$\mathcal{L}_\mathrm{evolve} = -\frac{1}{K\,G}\sum_{j=1}^K\sum_{g=1}^G A^j_g\, \log f_\theta(\hat p^j_g \mid \tilde p^j)$

Aggregate all generated peptides, rescore, and select the top $K$ as seeds for the next round.

This approach normalizes reward signals within each seed group, preventing high-variance updates from reward scale heterogeneity and promoting improvements relative to each group’s context.

5. Benchmarking and Comparative Results

PepEVOLVE was evaluated on a therapeutically relevant Rev-binding macrocycle (RBP) lead, derived from YPAASYR and engineered for head-to-tail cyclization. MPOs included:

Permeability ( $S_\text{perm}$ ), weight 3
Ring size constraint ( $S_\text{ring}$ ), weight 1
Lipophilicity ( $S_\text{lip}$ ), target $\sim-4.0$ , weight 1
SMARTS-based alerts ( $S_\text{SMARTS}$ ), weight 1

The composite score is defined as the weighted geometric mean: $\mathrm{Score} = \left(S_\mathrm{perm}^3 \times S_\mathrm{ring} \times S_\mathrm{lip} \times S_\mathrm{SMARTS}\right)^{1/6}$

Key benchmarking metrics and outcomes are summarized:

Configuration	Mean Score	Best Score	Steps to >0.8	Unique Peptides >0.9
PepINVENT	≈0.60	0.87	≈800	0
SS (self, single)	≈0.82	0.95	≈150	45
SM (self, multi)	≈0.80	0.95	≈200	40
NS (neighbor, single)	≈0.79	0.93	≈180	80
NM (neighbor, multi)	≈0.77	0.92	≈220	70

PepEVOLVE achieves higher mean and best scores and converges substantially faster (≈200 vs. 1000 steps) than PepINVENT. The NS configuration generates the largest set of unique high-scoring peptides, while SS converges fastest. SM balances yield and quality; NM, while trailing, still outperforms PepINVENT.

Router ablations confirm that the policy reliably learns chemically meaningful sites regardless of reward direction (e.g., high-donor or aromatic positions for hydrogen-bond donor/logP objectives), with position selection adapting under objective inversion.

6. Implementation Specifications

PepEVOLVE utilizes a transformer encoder–decoder of equivalent complexity to PepINVENT (e.g., 12 layers, hidden dimension 512, 8 attention heads). Pretraining uses 900k training and 50k validation peptides of length 6–18, with $\sim$ 30% non-canonical amino acids (NCAAs) and a mix of linear (40%) and macrocyclic (60%) configurations (including head-to-tail, sidechain-to-tail, disulfide cyclization).

Key hyperparameters:

Masking: Triangular $T(1, 0.4L, 0)$ distribution; dynamic resampling and CHUCKLES shift per epoch
Router: Batch $B=32$ , subset size $K=1$ or 2, candidates $G=16$ , baseline smoothing $\lambda=0.9$ , entropy coefficient $\beta$ annealed from 0.1 to 0.01
Evolving: Seeds $K=16$ , candidates $G=8$ , 4 context types, 250 steps (1000 calls)
Compute: Pretraining on 4×A100 GPUs (∼3 days); router and evolving on 2×A100 GPUs (∼24 h per benchmark)

7. Limitations and Prospects

PepEVOLVE’s context-free router currently lacks direct conditioning on sequence or 3D structure, and surrogate objectives such as solubility proxies are omitted. The use of GRA introduces the risk of mode collapse from over-normalization. Future developments may address these issues by incorporating structure-aware routers, integrating 3D predictors into reward functions, enabling finer multi-objective trade-off control, and expanding experiments across broader peptide target sets (Nguyen et al., 21 Nov 2025).

PepEVOLVE eliminates the requirement for static, hand-specified mutation sites and manual input selection, offering a reproducible, efficient approach for lead peptide optimization, especially when edit sites are a priori unknown.

Markdown Upgrade to Chat

References (1)

PepEVOLVE: Position-Aware Dynamic Peptide Optimization via Group-Relative Advantage (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PepEVOLVE.