Proposer–Solver Loop in Adaptive Systems
- The proposer–solver loop is an iterative interaction framework where a proposer generates candidate tasks and a solver refines solutions using adaptive feedback.
- It enables automated curriculum discovery and scalable self-improvement with applications in large language models, blockchain protocols, and optimization systems.
- Empirical results demonstrate enhanced task difficulty calibration and improved solution accuracy, validating its efficacy in multimodal reasoning and decentralized systems.
A proposer–solver loop is an iterative interaction paradigm central to a diverse set of research domains, including LLMs, multimodal self-improving systems, combinatorial optimization, and decentralized block production in blockchain protocols. This mechanism deploys two distinct roles: the Proposer, responsible for generating candidate queries, tasks, or solutions, and the Solver, which attempts to resolve these proposals through reasoning, search, or optimization processes. The loop is closed via mutual feedback, producing rewards or difficulty signals that guide both agents towards the frontier of performance, often in a fully unsupervised or self-play manner. The architecture is foundational to automated curriculum discovery, scalable self-improvement, and adaptive orchestration in agentic systems.
1. Formal Structure and Roles
At its core, the proposer–solver loop splits an underlying computational model or system into two agent views:
- Proposer (): Generates candidate tasks, queries, questions, or model components grounded in raw input spaces (images, text, topic specifications, etc.). For multimodal reasoning (as in EvoLMM (Thawakar et al., 20 Nov 2025)), the proposer outputs visually grounded math questions from images; in LLM self-play frameworks (PasoDoble (Zhang et al., 14 Nov 2025), Self-Questioning LLMs (Chen et al., 5 Aug 2025)), it generates challenging queries or code snippets.
- Solver (): Attempts to solve, reason over, or optimize the proposer's outputs, returning answers, solution trajectories, or code completions. The solver’s response consistency, correctness, or coverage typically serves as the feedback signal.
In reinforcement learning (RL) or adversarial setups, both agents are parameterized—often as adapters atop a frozen base model—updated via policy-gradient methods with intrinsic, structure-driven rewards.
Key mathematical definitions (as in EvoLMM):
- Empirical answer distribution:
- Proposer policy:
- Solver policy:
2. Algorithmic Loop and Learning Procedures
The iterative loop can be formalized as:
- Sample raw input (image , topic , ground-truth , etc.)
- Proposer outputs candidate or .
- Solver samples trajectories or answers .
- Evaluate internal consistency (e.g., answer entropy ), majority voting, or external verification (unit tests, search engine RAG).
- Compute continuous or discrete rewards for both agents:
- Solver reward: Based on answer correctness, agreement, brevity, or code unit test pass rates.
- Proposer reward: Often an entropy-based band-pass (EvoLMM), majority-vote window (SQLM), inverse solver success rate (PasoDoble). Rewards peak at the “edge of competence”—neither trivial nor impossible queries.
- Update policies via REINFORCE, PPO, or specialized curriculum RL with KL regularization, baselines, and adaptive controller mechanisms.
As a concrete example, EvoLMM leverages a frozen Qwen2.5-VL backbone with LoRA adapters for proposer and solver, each updated via self-generated rewards. Hyperparameters include samples per question, Gaussian band for proposer reward (, ), continuous solver reward softening (), and KL controller step size () (Thawakar et al., 20 Nov 2025).
3. Reward Shaping and Curriculum Dynamics
Reward architectures in proposer–solver loops are engineered for automated curriculum discovery and agent co-evolution:
- Band-pass rewards: Entropy-driven Gaussian rewards peak when solver uncertainty is neither near 0 (trivial) nor maximal (chaotic), incentivizing proposers to synthesize “just-hard-enough” tasks (Thawakar et al., 20 Nov 2025).
- Majority-vote windows: Proposer rewards assigned only when the solver’s responses are neither unanimous nor completely disagreeing, pushing for intermediate difficulty (Chen et al., 5 Aug 2025).
- Adversarial/inverse accuracy: In dual-play, the proposer’s reward is inversely proportional to solver accuracy, with diversity regularization to avoid repetitive collapse (Zhang et al., 14 Nov 2025).
- External tool feedback: In search self-play, proposer queries must be verifiable via all search results; reward is $1 -$ solver success rate, driving adversarial escalation of query complexity (Lu et al., 21 Oct 2025). RAG-based verification is crucial to prevent reward hacking.
- Optimization feedback: In interactive optimization, the human proposer iteratively refines the mathematical model based on solver outputs and constraint feedback, supporting explicit direct manipulation and gallery archiving (Liu et al., 2020).
The curriculum emerges dynamically: as the solver’s competence increases, the proposer must escalate task difficulty to retain non-trivial reward, resulting in continual agentic frontier advancement (Yue et al., 11 Jan 2026, Chen et al., 27 Oct 2025).
4. Applications Across Domains
LLM Self-Play and Co-evolution
Proposer–solver loops are foundational to recent unsupervised self-play RL methods that enable LLMs to improve reasoning and generalization without curated datasets:
- EvoLMM: Closed-loop unsupervised multimodal reasoning with continuous self-rewarding and entropy calibration (Qwen2.5-VL backbone). Delivers 3% absolute gains on ChartQA, MathVista, and MathVision benchmarks solely from raw images (Thawakar et al., 20 Nov 2025).
- PasoDoble: Dual-play adversarial training with clipped rewards and diversity constraints; offline decoupling for stability; up to 600 RL updates, outperforming label-dependent RLVR baselines (Zhang et al., 14 Nov 2025).
- Self-Questioning LLMs: Asymmetric self-play for mathematical and programmatic reasoning, utilizing majority-vote and code unit tests for unsupervised validation (Chen et al., 5 Aug 2025).
Search Agents and RAG-Driven Verification
Frameworks such as Dr. Zero and SSP build automated curricula via iterative search-query generation and solver verification:
- Dr. Zero: Data-free self-evolution in open-domain search, optimizing question hop groups and minimizing compute via HRPO; adversarial curriculum learning with group-level baselines (Yue et al., 11 Jan 2026).
- Search Self-Play: Co-evolution of search queries and document retrieval, RAG-verification, and competitive reward shaping (Lu et al., 21 Oct 2025).
Decentralized Optimization: Ethereum PBS
In blockchain consensus, proposer–solver loops underpin block production auctions. Validators (proposers) select execution payloads from competitive builders (solvers), with economic incentives tuned for MEV extraction, decentralization, and censorship resistance (Koegler, 22 Jun 2025, Heimbach et al., 2023).
- Formal loop: Builders solve for maximizing block value , submit bids to proposers, proposers select the highest bid, and payouts enforced by committee or burn auction (Koegler, 22 Jun 2025).
Physics-Informed ML and Code Synthesis
Solver-in-the-loop is crucial in domains demanding semantic or physical correctness:
- Turbulence closure: Neural network closures exposed to a differentiable ODE solver, gradients propagated across many time-steps, trajectory-based loss for non-Gaussian statistics (Freitas et al., 2024).
- Logic programming: ASP solver in-the-loop for logic puzzle encoding, using solver feedback to filter and fine-tune LLM-proposed partial programs; best-of- sampling for robust inference (Schrader et al., 18 Dec 2025).
5. Empirical Results and Implementation Characteristics
Across published implementations, proposer–solver loops yield consistent gains in target metrics:
- Multimodal math-reasoning: EvoLMM achieves $2$– improvement over strong baselines using only raw images and self-generated queries (Thawakar et al., 20 Nov 2025).
- LLM Reasoning: SQLM boosts three-digit multiplication accuracy from $0.791$ to $0.948$ after 100 RL steps, algebra gains of , and Codeforces programming up (Chen et al., 5 Aug 2025); PasoDoble attains strong advances with buffer-based training and diversity regularization (Zhang et al., 14 Nov 2025).
- Search Agents: Dr. Zero matches or surpasses supervised baselines on 5/7 QA tasks, HRPO yields $0.326$ EM vs. $0.320$ for GRPO, and achieves significant compute reductions (Yue et al., 11 Jan 2026); SSP shows $8$–$11$ point boost over fixed-opponent setups (Lu et al., 21 Oct 2025).
- Blockchain: PBS increases median block value $2$–, builder HHI , relay HHI up to $0.40$, but exposes tradeoffs in centralization and censorship (Heimbach et al., 2023).
- Turbulence modeling: Solver-in-the-loop closures reproduce high-order flatness up to , scaling exponents with MSE , and unbiased energy flux statistics at (Freitas et al., 2024).
- Logic code synthesis: ASP solver-in-the-loop boosts exact match rates from (greedy baseline) to with best-of- and regeneration+backtracking (Schrader et al., 18 Dec 2025).
6. Challenges, Controversies, and Stability Mechanisms
The efficacy of the proposer–solver loop depends on careful reward engineering, anti-hacking measures, and stability-promoting algorithmic choices:
- Reward hacking avoidance: Validity reward clips (PasoDoble), format checks (Dr. Zero, MAE), and RAG-verification (SSP) are enforced to prevent the proposer from exploiting ambiguous or unanswerable queries.
- Curriculum collapse: Diversity rewards, entropy band-pass filters, and buffer eviction of stale questions prevent degeneration into trivial or repetitive tasks.
- Training stability: Offline decoupling of agent updates (PasoDoble), group-level baselines (Dr. Zero), and frequency-scheduled proposer updates (SQLM) improve reward monotonicity and reduce gradient variance.
- Centralization and censorship: Protocol-level issues in builder concentration and relay trust remain unresolved in Ethereum PBS, as institutionalized via formal centralization and censorship metrics (Heimbach et al., 2023). Burn auctions and committee smoothing have been proposed as mitigations (Koegler, 22 Jun 2025).
7. Future Directions and Extensions
The proposer–solver loop paradigm is generalizable across data modalities, agent architectures, and dynamic environments:
- Scaling to higher-dimensional, continuous domains (e.g., full Navier–Stokes LES (Freitas et al., 2024)).
- Integration with external automated verifiers (e.g., ASP solvers, physics engines, interpreters) for semantic code generation and combinatorial reasoning (Schrader et al., 18 Dec 2025).
- Protocol-level enshrinement of decentralized block assembly and MEV redistribution in blockchain consensus mechanisms (Koegler, 22 Jun 2025).
- Automated curriculum discovery for human–algorithm hybrid teams in interactive optimization (Liu et al., 2020).
A plausible implication is that proposer–solver loops will increasingly serve as a universal mechanism for autonomous self-improvement and co-evolution in agentic systems, provided that robust reward shaping and adversarial stability protocols are maintained.