PepEVOLVE: Dynamic Peptide Optimization
- PepEVOLVE is a position-aware, dynamic peptide optimization framework that enhances macrocyclic peptide lead discovery by addressing combinatorial challenges.
- It employs dynamic pretraining strategies, including stochastic masking and CHUCKLES shifting, to improve model generalization and avoid overfitting.
- Integrating a multi-armed bandit router with an evolving reinforcement learning loop using group-relative advantage, PepEVOLVE efficiently navigates multi-objective design spaces.
PepEVOLVE is a position-aware, dynamic peptide optimization framework designed for multi-parameter exploration and optimization of macrocyclic peptides. Addressing the limitations of prior generative approaches, such as the necessity for chemist-specified mutable positions and static optimization protocols, PepEVOLVE employs a novel combination of dynamic pretraining, automatic site selection via a multi-armed bandit router, and evolving optimization through group-relative advantage (GRA) to efficiently navigate the combinatorial and multi-objective challenge of peptide lead discovery (Nguyen et al., 21 Nov 2025).
1. Motivation and Background
The optimization of macrocyclic peptides is challenged by a vast combinatorial design space and nonlinear, multi-parameter objectives (MPOs) encompassing potency, solubility, permeability, and pharmacokinetics. For example, a 12-mer constructed from 4,000 possible monomers yields up to candidates—a scale beyond the reach of enumerate-and-score methods, which are further constrained by vendor libraries (e.g., “top 20” per position yields %%%%1%%%% sequences). Multi-objective constraints interact nonlinearly, precluding brute-force optimization.
Machine learning-based generative models, including VAEs, GANs, RNNs, diffusion models, transformers, and LLMs, have been applied to propose de novo modifications and, in combination with reinforcement learning (RL), genetic algorithms, or MCTS, to traverse these MPO landscapes. PepINVENT, a key precursor, utilizes CHUCKLES (a SMILES-like tokenizer) and a transformer backbone, but is limited by its static masking (overfitting), chemist-specified edit sites (no automatic discovery), and static input protocols during RL (Nguyen et al., 21 Nov 2025).
PepEVOLVE addresses these deficiencies by introducing (i) dynamic pretraining via stochastic masking and rotational invariance, (ii) an automatic, context-free multi-armed bandit router to discover “where” to edit, and (iii) an evolving RL loop employing GRA to stabilize “how” edits are optimized under MPO constraints.
2. Dynamic Pretraining Strategies
2.1 Dynamic Masking
To circumvent the overfitting inherent in static masking, PepEVOLVE implements dynamic masking in which, for a peptide of length , the number of masked positions per epoch is drawn as
with %%%%3%%%% a triangular distribution on with mode , biasing toward single-site masks. These positions are selected uniformly at random and replaced with a special “?” token. This expands the diversity of the reconstruction task during pretraining.
2.2 CHUCKLES Shifting
To ensure robustness and prevent the model from memorizing absolute token positions, CHUCKLES-shifted augmentations are utilized. For a cyclic peptide with monomers , a random rotation is applied each epoch: This procedure treats all rotationally equivalent configurations as identical, enforcing invariance.
2.3 Pretraining Objective
The learning objective is the minimization of negative log-likelihood over the pretraining data: where is the masked source and is the target.
3. Automatic Edit Site Selection via Multi-Armed Bandit Router
The router algorithm formalizes each of the peptide positions as an “arm” in a context-free multi-armed bandit. On each episode, it samples a subset , , of positions to be masked and edited. For each subset , the generator proposes candidates , each evaluated for scalar reward . Rewards per subset are averaged:
The router's policy is parameterized by , defining a categorical distribution over -subsets. Policy-gradient updates use the REINFORCE algorithm with entropy regularization: where is the advantage, is a moving-average baseline, is an annealed entropy coefficient, and denotes Shannon entropy. This process concentrates probability on position subsets that yield higher multi-objective rewards.
4. Evolving Optimization and Group-Relative Advantage
PepEVOLVE employs an evolving optimization architecture that iteratively refines peptide candidates using group-relative advantage, stabilizing RL updates across heterogeneous seed groups.
Given initial seeds , the process per iteration is as follows:
- For each seed and context mask, generate candidates .
- Compute .
- Calculate within-group statistics:
- Compute group-relative advantage:
- Update the generator via loss:
- Aggregate all generated peptides, rescore, and select the top as seeds for the next round.
This approach normalizes reward signals within each seed group, preventing high-variance updates from reward scale heterogeneity and promoting improvements relative to each group’s context.
5. Benchmarking and Comparative Results
PepEVOLVE was evaluated on a therapeutically relevant Rev-binding macrocycle (RBP) lead, derived from YPAASYR and engineered for head-to-tail cyclization. MPOs included:
- Permeability (), weight 3
- Ring size constraint (), weight 1
- Lipophilicity (), target , weight 1
- SMARTS-based alerts (), weight 1
The composite score is defined as the weighted geometric mean:
Key benchmarking metrics and outcomes are summarized:
| Configuration | Mean Score | Best Score | Steps to >0.8 | Unique Peptides >0.9 |
|---|---|---|---|---|
| PepINVENT | ≈0.60 | 0.87 | ≈800 | 0 |
| SS (self, single) | ≈0.82 | 0.95 | ≈150 | 45 |
| SM (self, multi) | ≈0.80 | 0.95 | ≈200 | 40 |
| NS (neighbor, single) | ≈0.79 | 0.93 | ≈180 | 80 |
| NM (neighbor, multi) | ≈0.77 | 0.92 | ≈220 | 70 |
PepEVOLVE achieves higher mean and best scores and converges substantially faster (≈200 vs. 1000 steps) than PepINVENT. The NS configuration generates the largest set of unique high-scoring peptides, while SS converges fastest. SM balances yield and quality; NM, while trailing, still outperforms PepINVENT.
Router ablations confirm that the policy reliably learns chemically meaningful sites regardless of reward direction (e.g., high-donor or aromatic positions for hydrogen-bond donor/logP objectives), with position selection adapting under objective inversion.
6. Implementation Specifications
PepEVOLVE utilizes a transformer encoder–decoder of equivalent complexity to PepINVENT (e.g., 12 layers, hidden dimension 512, 8 attention heads). Pretraining uses 900k training and 50k validation peptides of length 6–18, with 30% non-canonical amino acids (NCAAs) and a mix of linear (40%) and macrocyclic (60%) configurations (including head-to-tail, sidechain-to-tail, disulfide cyclization).
Key hyperparameters:
- Masking: Triangular distribution; dynamic resampling and CHUCKLES shift per epoch
- Router: Batch , subset size or 2, candidates , baseline smoothing , entropy coefficient annealed from 0.1 to 0.01
- Evolving: Seeds , candidates , 4 context types, 250 steps (1000 calls)
- Compute: Pretraining on 4×A100 GPUs (∼3 days); router and evolving on 2×A100 GPUs (∼24 h per benchmark)
7. Limitations and Prospects
PepEVOLVE’s context-free router currently lacks direct conditioning on sequence or 3D structure, and surrogate objectives such as solubility proxies are omitted. The use of GRA introduces the risk of mode collapse from over-normalization. Future developments may address these issues by incorporating structure-aware routers, integrating 3D predictors into reward functions, enabling finer multi-objective trade-off control, and expanding experiments across broader peptide target sets (Nguyen et al., 21 Nov 2025).
PepEVOLVE eliminates the requirement for static, hand-specified mutation sites and manual input selection, offering a reproducible, efficient approach for lead peptide optimization, especially when edit sites are a priori unknown.