Design a dual-play LLM training framework that sustains improvement without external labels

Develop a dual-play adversarial training framework for large language models that achieves sustained performance improvement across training iterations without relying on large volumes of externally supervised labeled data.

Background

The paper motivates reducing reliance on externally supervised data in Reinforcement Learning with Verifiable Rewards (RLVR) and explores adversarial training paradigms for LLMs. Prior dual-model work such as R-Zero trains two models separately rather than adversarially and quickly plateaus, highlighting the need for a framework that supports continued co-evolution.

The authors position PasoDoble as a step toward this goal by adversarially training a Proposer and a Solver without supervision during training, but they explicitly identify the general design of a dual-play framework that sustains long-term iterative gains without large supervised datasets as an open challenge.

References

Thus, designing a dual-play framework for training LLMs that enables sustained improvement across training iterations without requiring large amounts of supervised training data remains an open challenge.

Better LLM Reasoning via Dual-Play  (2511.11881 - Zhang et al., 14 Nov 2025) in Section 1 (Introduction)