Proposer–Solver Co-Evolution Dynamics

Updated 13 January 2026

Proposer–solver co-evolution dynamics is a framework where a generative proposer creates tasks and an evaluative solver iteratively refines solutions, leading to self-improving curricula.
It employs advanced reward shaping and regularization techniques to maintain stability, task diversity, and adaptive difficulty across iterative training cycles.
This approach underpins advances in self-supervised learning, meta-learning, and algorithmic optimization, demonstrating measurable gains in reasoning and search tasks.

Proposer–solver co-evolution dynamics refers to a class of frameworks, analytical models, and training algorithms in which two (or more) agents—one in a generative (proposer) role and one in an evaluative (solver) or adversarial (solver, verifier, judge, etc.) role—are iteratively trained or evolved to improve synergistically or competitively. The core principle is the closed-loop interaction between the proposer, who generates candidate queries, tasks, or fitness functions, and the solver, who attempts to answer, solve, or maximize under these proposals. This structure gives rise to self-improving curricula, adaptive objective discovery, rich diversity in both generated queries and solution strategies, and—in many cases—a mathematically analyzable form of emergent equilibrium or periodicity. These dynamics underpin advances in self-supervised training of large multimodal or LLMs, meta-learning, evolutionary game theory, and algorithmic optimization.

1. Formalization of Proposer–Solver Co-Evolution

The canonical setting comprises a "proposer" agent (or population), $\pi_\phi$ or equivalent, which generates tasks, questions, or objective functions conditioned on available input (e.g., images, prior history, knowledge base). The "solver" agent, $\pi_\theta$ , attempts to satisfy the proposals by producing answers, reasoning trajectories, or optimized solutions. Both agents are typically parameterized by either evolvable populations (as in evolutionary systems) or neural networks with continuous parameters (as in modern deep RL and LLM frameworks).

Key components and formal elements include:

Generative Policy $\pi_\phi$ : proposes $q\sim\pi_\phi(q|x)$ (e.g., question given image).
Solving Policy $\pi_\theta$ : produces $y\sim\pi_\theta(y|x,q)$ (e.g., answer given $(x,q)$ ).
Reward Assignment: Rewards are typically task-adaptive, difficulty-sensitive, and may be continuous (as in EvoLMM (Thawakar et al., 20 Nov 2025)) or binary (as in adversarial setups).
Bidirectional Feedback: The performance of the solver on proposed tasks feeds back to adjust the difficulty and distribution of new proposals, while the hard/easy regions explored by propositions directly shape solver learning.
KL/Penalty Regularization: Used to regulate the magnitude of policy updates and maintain stability.

This paradigm is instantiated both in purely competitive (zero-sum), commensalistic, or mutualistic forms, with highly varied reward structures depending on the application domain (Thawakar et al., 20 Nov 2025, Piliouras et al., 2017, Sipper et al., 2022, Chen et al., 27 Oct 2025).

2. Iterative Training Loop and Algorithmic Implementations

The iterative structure common to proposer–solver co-evolution involves sampling a batch of proposals, evaluating solver responses, and updating both agents via policy-gradient or evolutionary mechanisms. Below is a schematic structure, as exemplified in EvoLMM (Thawakar et al., 20 Nov 2025):

Step 1: Proposer generates task (e.g., question $q$ given image $x$ ).
Step 2: Solver samples multiple solutions $y_1, ..., y_N$ to the task.
Step 3: Compute answer distribution $p(a|x,q)$ , entropy $H(x,q)$ , and/or direct match with ground truth if available.
Step 4: Compute rewards for both solver (internal consistency, correctness, diversity) and proposer (band-pass on entropy, difficulty, or diversity).
Step 5: Update parameters $\phi, \theta$ using REINFORCE, DPO, PPO, or evolutionary selection, with KL regularization for stability.

Pseudocode for EvoLMM is:

for step in range(MAX_STEPS):
    x = sample_image()
    q = proposer.sample(x)
    y_list = [solver.sample(x, q) for _ in range(N)]
    p = empirical_dist(y_list)
    H = entropy(p)
    r_sol = [agreement_score(y, p) for y in y_list]
    r_prop = bandpass_reward(H)
    solver.update(r_sol)
    if step % K == 0:
        proposer.update(r_prop)

(Thawakar et al., 20 Nov 2025)

Variations include band-pass reward shaping, curriculum adaptation via Gaussian utilities, alternating update steps, replay and history buffers for diversity preservation, and role separation via in-model role instructions (MAE (Chen et al., 27 Oct 2025)). In adversarial cases (PasoDoble (Zhang et al., 14 Nov 2025)), both online and offline update regimes are deployed to further stabilize training.

3. Mathematical Foundations and Reward Design

The mathematical underpinnings of proposer–solver co-evolution span dynamical systems, game theory, and RL/objective optimization. Reward shaping is essential and highly problem-dependent:

Entropy-Based Rewards: EvoLMM explicitly rewards proposers for questions where answer entropy $H(x,q)$ is near a preset target $\mu_H$ , using

$r^{\mathrm{prop}} = \exp\Big[-\frac{(H-\mu_H)^2}{2\sigma_H^2}\Big]$

supporting a moving "band" of optimal difficulty (Thawakar et al., 20 Nov 2025).

Continuous Agreement/Consistency: Solver rewards are computed as soft-powered answer probabilities, optionally penalized by output length:

$r^{\mathrm{sol}}_i = [p(y_i|x,q)]^\gamma\cdot [1-\lambda_{\mathrm{len}} \max(0, (w_i-\tau)/\tau)]$

Preference-Based Optimization: Socratic-Zero uses preference pairs and DPO for the solver, with curriculum utilities measuring solver success rate to adapt task difficulty:

$U(q'|\pi_{\theta_s}) = \exp\Big(-\frac{(s_{q'}-\mu)^2}{2\sigma^2}\Big)$

where $s_{q'}$ is observed solver success (Wang et al., 29 Sep 2025).

Zero-Sum and Diversity Coupling: In adversarial settings, the proposer maximizes task difficulty but is penalized for trivial/unsolvable tasks or duplicate proposals. Diversity is often enforced via Jaccard or Euclidean distances in token or parameter space (Zhang et al., 14 Nov 2025, Sipper et al., 2022).
KL Regularization: Per-token KL divergence to the frozen reference model is dynamically adaptively weighted to maintain solution diversity while preventing catastrophic drift (Thawakar et al., 20 Nov 2025, Chen et al., 27 Oct 2025).

In evolutionary models (e.g., SAFE (Sipper et al., 2022)), proposers are objective functions evolving by genotypic novelty, while solvers are candidate strategies evaluated against the best available fitness metric.

4. Dynamical Behavior and Stability Properties

Proposer–solver co-evolution can yield a range of complex dynamical behaviors, tightly linked to the feedback structure and reward design:

Periodic Dynamics: In analytically tractable multi-agent evolutionary games, the system admits Hamiltonian structure and supports measure-zero-failing, almost-everywhere periodic orbits, enforcing genetic or strategy diversity (Piliouras et al., 2017).
Curriculum Emergence: Across LLM and agentic systems, the curriculum naturally adapts when the proposer only receives positive reward for tasks at the current solver's frontier of difficulty. The distribution of tasks progresses from easy to hard, yielding monotonic solver improvement followed by stabilization (Thawakar et al., 20 Nov 2025, Yue et al., 11 Jan 2026, Chen et al., 27 Oct 2025).
Phase Transitions and Bifurcation: In networked strategic games (e.g., co-evolving multi-person ultimatum), transition between favoritism and group fairness is controlled by the frequency of network rewiring. A bifurcation occurs at a critical threshold of the partner-switching rate (Takesue et al., 2016).
Ablation Evidence: Freezing the proposer or omitting diversity/validity rewards causes early stagnation, plateaued solver performance, or in pathological cases, collapse into trivial/random outputs (Zhang et al., 14 Nov 2025, Chen et al., 27 Oct 2025, Thawakar et al., 20 Nov 2025).

5. Applications and Empirical Performance

Proposer–solver co-evolution frameworks have been successfully applied across diverse domains:

Unsupervised LMM and LLM Training: EvoLMM demonstrates consistent $\sim$ 3% performance gains on multimodal math reasoning benchmarks without any annotated data, relying purely on closed-loop proposer–solver self-evolution (Thawakar et al., 20 Nov 2025).
Self-Improving Reasoning Agents: MAE and Dual-Play (PasoDoble) frameworks show 4.5–8% absolute improvements on QA and math tasks, with the dynamic generation and resolution of questions by co-evolved agents (Chen et al., 27 Oct 2025, Zhang et al., 14 Nov 2025).
Search Agents & Self-Play: Search Self-Play and Dr. Zero achieve strong uniform gains on multi-hop and agentic QA benchmarks, using proposers to generate verified, difficult-search tasks and solvers to adapt via GRPO/HRPO (Lu et al., 21 Oct 2025, Yue et al., 11 Jan 2026).
Evolutionary Robotics and Meta-Objective Discovery: In SAFE, the co-evolution of solution controllers and objective functions yields robust navigation policies on deceptive maze problems, outperforming standard evolutionary and novelty approaches (Sipper et al., 2022).

A summary of exemplary frameworks is given below:

Framework	Proposer Role	Solver Role	Reward Structure
EvoLMM	Image-grounded Q generation	N-shot answering, maximize consistency	Band-pass (entropy) + internal agreement
MAE	Question generation (LLM)	Answering, leverage Judge feedback	Difficulty, quality, format filtering
Socratic-Zero	Teacher/Generator, refine q	DPO-based reasoning update	Curriculum-utility, preference feedback
SAFE	Objective-function evolution	Solution/genotype evolution	Genotype/novelty, maximum per-solution scoring
PasoDoble	Adversarial QA generation	Multi-try answering	Difficulty/diversity rewards, validity clipping
Search Self-Play	Deep search query proposal	Search-based multi-hop answer	Correctness, RAG-verification filtering

6. Theoretical Analysis and Equilibrium Outcomes

Theoretical analyses of proposer–solver co-evolution underline several generic equilibrium phenomena:

Hamiltonian Conservation: In team-zero-sum evolutionary settings, a conserved quantity (Hamiltonian) ensures bounded, recurrent cycles in phenotype space, supporting endogenous diversity without mutation or drift (Piliouras et al., 2017).
Correlated Equilibrium Emergence: Time-averaged play across periodic or quasi-periodic cycles attains coarse correlated equilibria, attaining team-game value, and optimal population mixtures at the species level (Piliouras et al., 2017).
Commensalistic and Adversarial Regimes: Depending on feedback coupling, emergent behaviors range from arms-race/adversarial escalation, curriculum adaptive improvement, to static commensalistic benefit (as in SAFE, where objective functions evolve by novelty independent of solver performance (Sipper et al., 2022)).
Stability via Continuous Rewards and KL Regularization: Continuous soft rewards and adaptive regularization are both empirically and analytically critical for avoiding mode collapse, trivialization, or reward hacking (Thawakar et al., 20 Nov 2025, Chen et al., 27 Oct 2025, Zhang et al., 14 Nov 2025).

7. Limitations, Challenges, and Open Directions

While proposer–solver co-evolutionary methods offer major advances in data-free and unsupervised improvement of large models and strategic systems, several open challenges persist:

Reward Hacking and Trivialization: Poorly shaped or poorly filtered rewards induce degenerate proposer or solver strategies (e.g., unsolvable or trivial questions).
Stability and Scalability: Empirical stability is sensitive to reward clipping, schedule of agent updates, and regularization constants; theoretical convergence guarantees remain limited in deep RL and language domains.
Generalization and Diversity: Generative diversity must be enforced to avoid collapse onto narrow task or solution distributions.
Transfer and Meta-Learning: Generalizing these dynamics across domains, modalities, and agent architectures, and quantifying transferability, remain open areas for future research.

Empirical evidence across multiple benchmarks, ablation studies, and dynamical analyses confirm that carefully coupled proposer–solver systems, utilizing continuous bidirectional and band-pass feedback, yield robust, domain-agnostic frameworks for self-improving models and strategic agents (Thawakar et al., 20 Nov 2025, Chen et al., 27 Oct 2025, Zhang et al., 14 Nov 2025, Lu et al., 21 Oct 2025, Wang et al., 29 Sep 2025, Sipper et al., 2022, Piliouras et al., 2017).