Dynamic goalpost updating in Guided Asymmetric Self-Play (GASP)

Determine how to update the set of hard goalpost questions used to guide the teacher in Guided Asymmetric Self-Play (GASP) for code generation once a goalpost question is reached by the student, so that the guidance continues to remain meaningful as the model improves.

Background

GASP steers asymmetric self-play using a fixed set of hard real-data goalpost questions drawn from the LiveCodeBench training split (pre-2024.08). These goalposts are chosen to be unsolved by standard RLVR training and by an AZR checkpoint, and they guide the teacher to generate lemma–lift curricula that expand the student’s capabilities.

As training progresses, some goalpost questions become solvable, raising the need to decide how the guiding set should evolve. The paper explicitly identifies the unresolved question of what to do once a goalpost is reached and notes that dynamic goalpost updating is left for future work to keep guidance relevant as the model improves.

References

Finally, our framework uses a fixed set of goalposts. An open question is what should happen once a goalpost is reached: ideally, the goalpost set should be updated over time so that guidance remains meaningful as the model improves. We leave dynamic goalpost updating to future work.

GASP: Guided Asymmetric Self-Play For Coding LLMs  (2603.15957 - Jana et al., 16 Mar 2026) in Discussion — Limitations