Richelieu: Self-Evolving LLM Agents

Updated 22 April 2026

Richelieu is a self-evolving LLM-based agent framework that automates the discovery, evaluation, and deployment of novel model architectures, optimizers, and reward mechanisms.
It employs a dual-loop (offline and online) design with co-evolving agent personas and robust safety controls to scale experimentation and drive significant performance gains.
The framework increases experiment throughput by 10–100× and reduces human effort, achieving measurable improvements in core metrics on production systems.

Richelieu: Self-Evolving LLM-Based Agents

Richelieu is a self-evolving agentic framework that leverages LLMs as autonomous, expert machine learning engineers. It orchestrates a fully end-to-end automated loop for discovering, evaluating, and deploying novel model architectures, optimizers, and reward designs. Richelieu is instantiated in production-scale systems such as YouTube recommendation and in a variety of experimental open-ended agent domains. It is characterized by dual-loop LLM-driven agents, closed-loop co-evolution of multiple agentic roles, and rigorous safety and infrastructure controls suitable for deployment in critical, high-throughput environments (Wang et al., 10 Feb 2026, Sun et al., 16 Oct 2025).

1. System Architecture and Agentic Roles

Richelieu implements a dual-loop architecture tailored for autonomous model optimization:

Inner Loop (Offline Agent): Functions on a high-frequency timescale (hours), exploring the configuration space by generating and evaluating candidate changes using proxy metrics and logs. It operates as a hypothesis generator, specializing in optimizer, architecture, and reward-persona modes. Output candidates are ranked by significance and serialized to a model repository.
Outer Loop (Online Agent): Operates on a slower cadence (days to weeks), validating Inner Loop survivors through live A/B tests on production traffic and logging “north-star” business metrics (e.g., user watch time, retention) (Wang et al., 10 Feb 2026).

Agents interface via a shared Experiment Journal that aggregates historical configurations, metrics, and results—enabling deep contextual reasoning, statistical significance tracking, and avoidance of redundant exploration. The framework enforces safety via linter personas, drift detection, static checks, and traffic guardrails.

Schematic Overview

1
2
3

Experiment Journal ─> Offline Agent ─> Model Repo ─> Online Agent ─> Experiment Journal
                           |                                    ^
                           |---(Tool calls, offline metrics)----|

This agentic decomposition enables Richelieu to scale search to hundreds of experiments per week, vastly outpacing traditional human workflows (Wang et al., 10 Feb 2026).

2. Autonomous Optimization: Algorithms and Objectives

Richelieu's optimization core is formalized as a bi-level objective:

Lower level: Model parameters $\theta$ are optimized for a given meta-configuration $\Phi$ to minimize a proxy loss,

$\theta^*(\Phi) = \arg\min_\theta \mathcal{L}_\text{proxy}(\mathcal{D}; \theta, \Phi)$

Upper level: Selects the meta-configuration $\Phi$ that maximizes live online reward $M(\theta^*(\Phi))$ while respecting cost constraints,

$\Phi^* = \arg\max_\Phi \mathbb{E}[M(\theta^*(\Phi))] \quad \text{s.t.} \quad G(\Phi) \leq C$

Key agentic personas perform targeted exploration:

Optimizer Persona: Mutates learning rates, optimizers (e.g., RMSprop), and other hyperparameters to minimize offline proxy loss.
Architecture Persona: Generates code changes introducing gating (GLU), normalization, and novel layouts, with evaluation via loss.
Reward Persona: Proposes new reward proxies mined from engagement logs (e.g., correlation with active engagement or video quality), advancing candidates correlated with long-term metrics (Wang et al., 10 Feb 2026).

LLMs drive a Think–Code–Verify microcycle per candidate, synthesizing code/config diffs, running syntactic checks, launching training or SQL analysis jobs, and ranking based on offline metrics (statistically significant improvements only promoted).

3. Closed-Loop Multi-Role Co-Evolution

Extensions of the Richelieu paradigm—sometimes called "Agentic Self-Learning (ASL)" or "Multi-Agent Evolve"—decompose the agent into modular, co-evolving roles sharing a parameter backbone:

Prompt Generator: Builds a curriculum of progressively harder or diverse tasks.
Policy Model: Attempts to solve tasks, generating answers or outputs.
Reward Model (GRM/Judge): Acts as a generative judge, verifying correctness and providing scalar rewards; crucially, the GRM is co-evolved with the policy to prevent reward hacking (Sun et al., 16 Oct 2025, Chen et al., 27 Oct 2025).

Co-evolution proceeds as a virtuous cycle: prompt generator proposes, policy acts, GRM verifies, and all three roles update in response. This closed loop delivers continual performance improvements—even in zero-labeled-data conditions—and outpaces static or rule-based reward settings. Successful Richelieu-style agents leverage:

Entropy-based rewards to ensure the curriculum’s adaptive difficulty,
Continual calibration of generative rewards to sustain progress and robustness,
Synthetic data scale-up (tens of thousands of tasks/candidates) as critical for downstream accuracy (Sun et al., 16 Oct 2025).

4. Empirical Performance and Experimental Validation

Quantitative evaluation of Richelieu demonstrates:

Production Impact: At YouTube scale, agent-discovered changes (e.g., RMSprop adoption, gating layers, multi-objective reward synthesis) yielded statistically significant improvements of +0.03% to +0.14% in core business metrics—exceeding improvement rates from traditional engineering cycles (Wang et al., 10 Feb 2026).
Ablation Studies: Removal of expert MLE persona, LLM size downgrades, history randomization, or top-k memory reduction each degrades solution quality; full, sorted context and larger LLMs consistently perform best.
Acceleration: Richelieu increases experiment throughput by 10–100× with zero human labor, cutting idea-to-deployment latency from days to hours.

In reinforcement-style self-evolution settings, co-evolving GRMs in the loop drive continued accuracy gains, while static verifiers or rule-based rewards yield rapid plateaus and susceptibility to adversarial task generation (Sun et al., 16 Oct 2025, Chen et al., 27 Oct 2025).

5. Infrastructure, Safety, and Scalability

The Richelieu production instantiation employs:

Kubernetes-managed orchestration of LLM inference, training jobs, SQL analytics, and experimental workflows.
Experiment Journal: Centralized corollary to lifelong memory, implemented as a BigQuery/SQL repository for all diffs, metrics, and logs.
Modular orchestration (e.g., dynamic mixture-of-experts gating, meta-cognition engines) further generalizes Richelieu to lifelong multi-domain agent workloads, optimizing token and compute resources, and enabling just-in-time expert hiring/eviction so that context size and latency remain manageable (Sampath et al., 10 Jan 2026).

Safety is maintained by embedding syntax/schema enforcement, training drift detection, compliance checks, and explicit early stopping on violation of guardrail metrics (e.g., no metric regresses by >1%).

A summary comparison of throughput, improvement, and safety features:

Agent Mode	Offline Throughput	Human Effort	Safety Guardrails
Human ML engineer	1–10/week	1–10 hr/exp	Manual review
Richelieu (full-loop)	~100/week	0 hr	Automated, LLM-enforced

6. Limitations and Future Directions

Observed limitations include:

Cold-Start Fragility: An empty Experiment Journal yields only basic, "textbook" proposals; initial human or random seeding is needed.
Safety Tuning Dilemma: Overly conservative guardrails inhibit exploration; too lenient risks deployment harm.
Inference Cost: High-quality reasoning depends on large LLMs, increasing inference latency and monetary cost. For critical experimentation, Pro-class Gemini models outperform smaller variants by +0.2–0.4 standard deviations in regression targets (Wang et al., 10 Feb 2026).

Research frontiers and extensions involve:

Cross-surface meta-learning by sharing journals/personas to bootstrap novel product lines.
Automated, LLM-driven proposal of dynamic safety thresholds based on real-time traffic variance.
Mixed-initiative agents integrating human strategic directions with autonomous loops ("focus on cold-start retention").
Multi-agent co-evolution frameworks, where complementary LLM agents handle fairness, causal auditing, or ethical compliance in the same dual-loop infrastructure.
Integration with world-model lookahead (co-evolving environment simulators for imagination and sample efficiency) and “native” meta-evolution agents that operate reward-free at inference time via self-generated world knowledge (Zhang et al., 20 Apr 2026, Fang et al., 23 Apr 2025).

7. Position Relative to Broader LLM Agency

The Richelieu paradigm sits at the intersection of closed-loop agentic optimization, tool-using LLM augmentation, and open-ended curriculum evolution:

It extends prompt-driven task loops and chain-of-thought frameworks by externalizing memory and summary, tracking hypothesis history, and leveraging role/persona switching (Nachkov et al., 16 Oct 2025).
Compared with ReAct-style open-ended agents, Richelieu enforces much more rigorous safety checking, structured memory evolution, and explicit experimental selection (Nachkov et al., 16 Oct 2025, Guan et al., 2024).
In multi-agent and embodied contexts, Richelieu inspires further advancements combining individual learning, team-level communication evolution, and dynamic knowledge distillation (Li et al., 8 Jun 2025).

Taken together, Richelieu and its descendants exemplify how LLM-based agents can achieve autonomous, continual, and scalable self-improvement—surpassing human- and rule-driven baselines across a spectrum of application domains (Wang et al., 10 Feb 2026, Sun et al., 16 Oct 2025, Zhang et al., 20 Apr 2026).