Self-play with Execution Feedback

Updated 26 August 2025

The paper introduces a self-play paradigm where agents generate tasks and refine their strategies based on real execution outcomes.
It employs a proposer-executor architecture that creates adaptive curricula, enhancing sample efficiency and policy alignment.
The approach underpins autonomous learning across domains such as reinforcement learning, large language models, and code generation without manual supervision.

Self-play with execution feedback refers to a class of machine learning methodologies in which an agent (or a collection of symmetrically-related roles) interacts with itself or its environment to generate data, propose and solve tasks, and iteratively improve by integrating the explicit outcomes (execution feedback) of its own actions. This paradigm is instantiated in various forms across reinforcement learning (RL), LLM alignment, program synthesis, tool use, code generation, and dialog systems. In all cases, execution feedback—derived from actual environmental outcomes or programmatic validation—serves as a grounding mechanism for autonomous learning, curriculum creation, verification, or correction, obviating the need for human-curated labels or reward models.

1. Core Principles and Architectures

The foundational architecture of self-play with execution feedback is characterized by agent roles (sometimes called “Alice” and “Bob” (Sukhbaatar et al., 2017), proposer/solver (Zhao et al., 6 May 2025), or generator/discriminator pairs) that alternate between proposing tasks and attempting to solve them. Execution feedback emerges from the environment or an explicit evaluation component (e.g., a code executor, test harness, or critic model), providing binary, scalar, or programmatically-derived validation signals.

A prototypical example is the asymmetric self-play paradigm (Sukhbaatar et al., 2017), where Alice proposes a task by executing actions to reach a state, and Bob is then required to undo (reversible environment) or repeat (resettable environment) Alice’s trajectory. Bob’s task performance (execution time or success) generates feedback that determines internal rewards for both agents, closing the self-improvement loop.

This architecture generalizes to multi-agent RL games (Charlesworth, 2018), curriculum-inducing LLM protocols (Dong et al., 19 Jun 2024), negotiation setups with LLM critics (Fu et al., 2023), tool learning (Qiao et al., 2023), iterative code generation (Yang et al., 2023, Peng et al., 18 Nov 2024), and more, unifying disparate domains under the self-play with feedback mechanism.

2. Execution Feedback as a Learning Signal

Execution feedback is leveraged as an explicit outcome signal, enabling the agent to refine its actions or outputs without requiring manually-annotated supervision.

In RL and task proposals: The success, failure, or efficiency of task execution is fed back as a reward, guiding curriculum growth or policy optimization (Sukhbaatar et al., 2017, Zhao et al., 6 May 2025).
In program synthesis and code generation: Generated programs are executed against gold or augmented test cases; pass/fail signals gate data acceptance via rejection sampling or score optimization (Yang et al., 2023, Peng et al., 18 Nov 2024, Chen et al., 22 Jan 2025).
In preference learning and alignment: Execution feedback takes the form of trajectory “win rates” (proportion of head-to-head victories) (Swamy et al., 8 Jan 2024), or validation of tool use correctness (TRICE (Qiao et al., 2023)).
In negotiation and dialog: Feedback is either an explicit deal outcome, a third-party critic’s suggestions, or an execution outcome in programmatic state tracking (Fu et al., 2023, Coca et al., 21 Aug 2025).
In reinforcement fine-tuning: The outcome of multi-turn decision processes, such as code self-repair (Gehring et al., 2 Oct 2024), directly conditions subsequent policy rollouts.

This feedback can be binary (success/failure), scalar (normalized reward or execution time), or structured (verbal error messages (Li et al., 15 Sep 2024), unit test outcomes, execution traces (Chen et al., 22 Jan 2025)), making the paradigm robust to application contexts.

3. Curriculum Induction and Automatic Difficulty Adjustment

Self-play with execution feedback inherently induces adaptive curricula. Task proposals become progressively more complex as the agent’s capabilities increase, due to the reward structure and feedback loop:

The proposer (e.g., Alice) is incentivized to generate tasks that are achievable yet challenging for the executor (e.g., Bob), as measured by differential in execution steps or solution time (Sukhbaatar et al., 2017).
Curriculum generation does not require handcrafting; the complexity of encountered tasks increases in step with the agent’s own competence. This has been empirically shown to improve sample efficiency, reduce the amount of externally supervised training required, and enable transfer to more complex downstream tasks in both RL and code domains (Sukhbaatar et al., 2017, Dong et al., 19 Jun 2024).
In absolute self-curriculum settings, such as "Absolute Zero" (Zhao et al., 6 May 2025), the agent simultaneously evolves both the task distribution and the solving strategy, with execution feedback validating novelty, non-triviality, and correctness.

4. Algorithmic and Theoretical Foundations

Multiple algorithmic instantiations use execution feedback for robust self-play:

Asymmetric self-play: Sequential task proposal/execution, with reward functions directly coupled to step counts or success rates (Sukhbaatar et al., 2017).
Actor-critic and PPO-based RL: Proximal Policy Optimization (PPO) is recurrently used, with execution feedback defining the reward at each episode or dialog turn (Charlesworth, 2018, Gehring et al., 2 Oct 2024).
Direct Preference Optimization (DPO): Execution feedback is used to form preference pairs (passing/failing outputs) to directly optimize policy difference margins, especially in LLM fine-tuning (Dong et al., 19 Jun 2024, Zhai et al., 25 Mar 2025).
No-regret and minimax games: In preference-based RL, self-play is formulated as a zero-sum game over trajectory-level preferences, with convergence to minimax-winner policies via online optimization (Swamy et al., 8 Jan 2024).
Regularization and Stability: Self-play feedback can result in unstable policy drift; mitigations include KL regularization with a base model, geometric mixtures of policies, and fictitious play against the history of earlier iterates (Alami et al., 4 Apr 2024).

Mathematical reward and update formulas from the data include: $R_B = -\gamma \cdot t_B,\qquad R_A = \gamma \cdot \max(0, t_B - t_A)$ (Sukhbaatar et al., 2017), and

$\mathcal{L}_{DPO}(\theta; \pi_\mathrm{ref}) = -\mathbb{E}_{x, y_w \succ y_l} \left[ \log \sigma\left(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_\mathrm{ref}(y_w|x)} - \beta\log\frac{\pi_\theta(y_l|x)}{\pi_\mathrm{ref}(y_l|x)}\right) \right]$

(Alami et al., 4 Apr 2024, Dong et al., 19 Jun 2024, Zhai et al., 25 Mar 2025).

5. Applications and Empirical Innovations

Self-play with execution feedback has led to significant improvements across diverse domains.

Unsupervised RL Curriculum: Asymmetric self-play has been shown to reduce sample requirements and reach higher eventual rewards in environments such as Mazebase, StarCraft, and continuous control problems (Sukhbaatar et al., 2017).
Instruction-following LLMs: The AutoIF framework leverages code verification and unit test execution to filter and generate high-quality, self-supervised instruction-following data for SFT and DPO, enabling strong self-alignment and distillation (Dong et al., 19 Jun 2024).
Tool learning: TRICE’s RLEF method enables LLMs to learn not only which tool to use but also when to forgo tool use, guided by multi-stage execution feedback (Qiao et al., 2023).
Code generation and optimization: Execution feedback is central to program repair, performance optimization (PerfCodeGen (Peng et al., 18 Nov 2024)), interactive debugging (InterCode (Yang et al., 2023)), and thought-level reasoning correction (RethinkMCTS (Li et al., 15 Sep 2024), in-execution debugging (Chen et al., 22 Jan 2025)).
Dialogue State Tracking: PyTOD employs execution-aware code generation and validation to achieve cross-turn state tracking improvements on task-oriented dialogue benchmarks (Coca et al., 21 Aug 2025).
Preference RL and alignment without reward models: Policy improvement via direct execution-based comparison avoids the instability and compounding errors of reward model-based RLHF, offering robustness to preference noise and intransitivity (Swamy et al., 8 Jan 2024).
Zero-data, fully autonomous curricula: The Absolute Zero Reasoner self-evolves both problems and solutions, validated only through code execution, achieving strong out-of-distribution and mathematical reasoning performance (Zhao et al., 6 May 2025).

Summary tables:

Domain	Self-Play Role Structure	Execution Feedback Type
RL/Exploration	Proposer/Executor	Reward (steps/time/success)
LLM alignment	Generator/Critic (DPO)	Pass/fail on tests, win rates
Code synthesis	Generator/Self-debugger	Pass/fail/error trace
Tool use	SFT + RLEF	Output correctness, API use
Dialogue	Parser/Execution Supervisor	State update validation

Algorithmic Aspect	Execution Feedback Utilization
Curriculum learning	Task difficulty scaling based on execution
Policy optimization	Gradient update via execution outcome
Regularization	KL to base model, fictitious play history
Code/test ranking	Dual-critic scoring on execution outcomes
Program repair/self-debugging	In-execution trace/prompt for fix

6. Limitations, Challenges, and Future Outlook

There are open technical challenges:

Feedback Quality and Bias: Self-generated tests or verification routines may be incorrect or biased, leading to false negatives/positives and learning drift (Chen et al., 22 Jan 2025, Wang et al., 18 Dec 2024). Incorporating in-execution traces and cross-validation with external references can mitigate but not eliminate such biases.
Credit Assignment: Trajectory-level signals complicate local update attribution; averaging rewards or applying potential-based shaping is prevalent but may dilute signal (Swamy et al., 8 Jan 2024).
Nonstationarity and Instability: When an agent plays itself, regularization against a base model or fictitious averaging is necessary to avoid policy collapse or runaway drift (Alami et al., 4 Apr 2024).
Scalability to Complex Domains: Extensions to multi-language, multi-tool, or unbounded dialog domains challenge current execution feedback estimation and necessitate further research in automated verification, compositional tool use, or richer simulated environments (Yang et al., 2023, Wang et al., 18 Dec 2024, Coca et al., 21 Aug 2025).
Autonomy and Open-Endedness: Self-evolving curricula risk diverging from human-aligned utility or correct semantics unless appropriately grounded by robust execution environments (Zhao et al., 6 May 2025).

A plausible implication is that future research will focus on improving automated validation, leveraging more expressive intermediate feedback (e.g., runtime traces, semantic interpreters), and integrating multi-agent or adversarial feedback for robust, scalable, and safe autonomous agent learning.

7. Significance and Emerging Directions

Self-play with execution feedback represents a unifying paradigm for scalable, self-improving AI systems. By grounding learning in explicit, verifiable outcomes produced by the agent’s own actions or code, the method provides a robust foundation for curriculum generation, policy alignment, and autonomous skill acquisition. Its instantiations span from RL environment exploration (Sukhbaatar et al., 2017), LLM self-alignment and negotiation (Dong et al., 19 Jun 2024, Fu et al., 2023), tool learning (Qiao et al., 2023), code optimization (Peng et al., 18 Nov 2024, Yang et al., 2023), to unsupervised task proposal and solving (Zhao et al., 6 May 2025). The paradigm’s trajectory is shaped by advances in programmatic validation, multi-agent self-improvement, and cross-domain transfer, with execution feedback providing the central, objective learning signal underpinning robust autonomy.