Papers
Topics
Authors
Recent
Search
2000 character limit reached

Guided Asymmetric Self-Play (GASP)

Updated 3 July 2026
  • Guided Asymmetric Self-Play (GASP) is a post-training paradigm for LLMs that uses a teacher–student loop anchored by challenging real-data tasks.
  • It dynamically generates simpler lemmas and harder lifts with specific pass rate thresholds to progressively enhance the student's capabilities.
  • Empirical results in program synthesis, such as improved pass rates on LiveCodeBench, validate GASP's effectiveness over traditional self-play methods.

Guided Asymmetric Self-Play (GASP) is a post-training paradigm for LLMs and other learning agents, in which a synthetic teacher–student curriculum is adversarially constructed yet firmly anchored by challenging real-data benchmarks. Designed to overcome major shortcomings of earlier goal-agnostic asymmetric self-play strategies, GASP enables continual capability enhancement by aligning synthetic exploration with the unsolved frontier of real-world tasks. The method has been primarily developed and validated in the context of program synthesis, exhibiting marked improvements, such as increased pass rates on LiveCodeBench, relative to both unguided self-play and standard real-data reinforcement learning.

1. Background: Asymmetric Self-Play and Curriculum Emergence

Asymmetric Self-Play (ASP) introduces a framework in which two agents, typically instances of the same network ("Alice" and "Bob"), co-operatively explore an environment by alternately acting as task proposer and task solver. The asymmetric structure arises from distinct reward signals: the proposer is incentivized to generate tasks just beyond the solver’s current capability, while the solver seeks efficient completion. This dynamic leads to the automatic emergence of a curriculum, with difficulty progressing to match the evolving capabilities of the agent. ASP has been formalized for Markov Decision Processes (MDPs) with policies πA\pi_A (Alice) and πB\pi_B (Bob), each mapping observational contexts to action distributions, and an internal reward structure designed to maximize task informativeness and learning efficiency (Sukhbaatar et al., 2017).

2. Motivation and Limitations of Unguided Asymmetric Self-Play

Standard asymmetric self-play procedures, such as "Absolute Zero" (AZR), generate synthetic tasks at the edge of the student’s competence by relying on student performance feedback alone. However, unguided self-play is fundamentally goal-agnostic: the teacher has no incentive to propose tasks that align with downstream benchmark relevance. Empirically, this leads to synthetic data that can be either trivial or off-distribution relative to real tasks. The absence of a real-data grounding mechanism means that solved progress does not necessarily translate to improved performance on held-out real-world benchmarks, resulting in either stagnation or curriculum drift (Jana et al., 16 Mar 2026).

3. Formal Structure of Guided Asymmetric Self-Play

GASP extends ASP by integrating a persistent "goalpost" set GG of unsolved, challenging real-data tasks. On each curriculum iteration:

  • The teacher TT samples a goalpost h∈Gh \in G.
  • TT generates a simplification ("lemma" â„“0\ell_0) using a stochastic mapping feasy:G→Qf_{\text{easy}}: G \to \mathcal{Q}, targeting an intermediate difficulty band defined by the student’s pass rate (0.3≤pS(â„“0)≤0.70.3 \leq p_S(\ell_0) \leq 0.7).
  • Conditioned on â„“0\ell_0, Ï€B\pi_B0 produces a more difficult "lift" Ï€B\pi_B1 via Ï€B\pi_B2, targeting a stricter band (Ï€B\pi_B3).

Rewards are shaped by sharply peaked functions at the pass rates’ midpoints, maximizing teacher learning on tasks that are neither trivial nor impossible:

Ï€B\pi_B4

Accepted tasks are filtered for novelty to avoid mode collapse. The training set for the student is the union of accepted lemmas and lifts. This loop ensures that synthetic tasks are always motivated by authentic, unsolved real-data difficulties rather than arbitrary exploration (Jana et al., 16 Mar 2026).

4. Curriculum Grounding, Progression, and Diversity

Goalpost grounding is implemented by selecting πB\pi_B5 as the set of real-data coding problems unsolved by several strong baselines, including RLVR and prior self-play checkpoints. This set typically comprises approximately 25% of the overall benchmark (e.g., πB\pi_B6 out of πB\pi_B7 on LiveCodeBench). Using pass@100 measurements, πB\pi_B8 is iteratively updated to reflect the current challenge frontier. Curriculum progression is managed by:

  • Moving from lemma-generation (easier, partially solvable) to lift-generation (harder, near the goalpost).
  • Dynamically updating target difficulty bands as the student's performance improves, closing the gap to the ultimate goalpost over successive iterations.
  • Enforcing diversity by rejecting tasks with cosine similarity exceeding 0.95 to any prior element in a rolling buffer.

This approach maintains curriculum relevance and avoids overfitting or stagnation on narrow subregions of the task space (Jana et al., 16 Mar 2026).

5. Algorithmic Details and Implementation

GASP's loop is implemented via three explicit phases per iteration—lemma generation, lift generation, and student update—operating with batched sampling. The pseudocode, as provided in the original work, formalizes the process:

Ï€B\pi_B9 Novelty-based rejection sampling is critical for preserving curriculum diversity and avoiding training instability. All teacher and student updates use standard RLVR (Reinforcement Learning with Verifier Rewards) policy optimization (Jana et al., 16 Mar 2026).

6. Empirical Performance and Ablation Findings

GASP has been empirically validated on the LiveCodeBench v5 coding benchmark. Main findings:

Method pass@20 (Eval split) Goalpost Solves (146)
Qwen2.5-Coder-7B base 29.68% —
AZR (unguided self-play) 31.15% 0
RLVR on real data 33.10% 0
GASP (no real data) 33.69% (±0.28) 11
GASP + real-data RL 34.46% (±0.34) 10

Ablation studies show that removing novelty-based rejection increases solution variance and training instability, though some seeds achieve higher numbers of goalpost solves (14/146, but with 4× iterations and less stability). Collapsing the curriculum to a single "one-step hard" stage underperforms the full two-stage pipeline, both in overall pass rates and goalpost solves. Removing input/output axis difficulty control leads to more variable results and fewer uniquely solved goalposts (Jana et al., 16 Mar 2026).

7. Connections to Adversarial Curriculum Learning and Extensions

GASP shares conceptual foundations with Heterogeneous Adversarial Play (HAP) frameworks, which model teacher–student dynamics as a minimax game integrating explicit entropy and minimum-probability regularizers to stabilize and diversify the task curriculum. Key mechanisms from HAP that are directly relevant to GASP include:

  • The use of entropy regularization for exploration;
  • Lower-bound constraints on task-sampling probabilities to prevent forgetting of previously learned skills;
  • Warm-up protocols to avoid cold starts when initial student performance is at chance.

A plausible implication is that integrating these formal regularization, warm-up, and bidirectional feedback mechanisms may further improve the stability and efficiency of GASP curricula. HAP also demonstrates that meta-learning and online adaptation of the teacher policy using explicit gradient-based minimax objectives can extend the applicability of GASP-like guided curriculum learning (Xu et al., 21 Oct 2025).

8. Significance and Outlook

The essential contribution of GASP is the introduction of minimal, targeted inductive bias—specifically, dynamic alignment of synthetic self-play exploration with a static but evolving frontier of unsolved benchmark tasks. This results in synthetic data that is maximally informative, relevant, and capable of catalyzing non-trivial capability improvements, including on tasks that remain unsolved by both unguided self-play and conventional real-data RL. Its staged, grounded curriculum mechanism and novelty-centric sampling regime distinguish GASP as a robust approach for scalable, data-efficient, and benchmark-aligned post-training of coding LLMs and other complex learning agents (Jana et al., 16 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Guided Asymmetric Self-Play (GASP).