Dice Question Streamline Icon: https://streamlinehq.com

Design a dual-play LLM training framework that sustains improvement without external labels

Develop a dual-play adversarial training framework for large language models that achieves sustained performance improvement across training iterations without relying on large volumes of externally supervised labeled data.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper motivates reducing reliance on externally supervised data in Reinforcement Learning with Verifiable Rewards (RLVR) and explores adversarial training paradigms for LLMs. Prior dual-model work such as R-Zero trains two models separately rather than adversarially and quickly plateaus, highlighting the need for a framework that supports continued co-evolution.

The authors position PasoDoble as a step toward this goal by adversarially training a Proposer and a Solver without supervision during training, but they explicitly identify the general design of a dual-play framework that sustains long-term iterative gains without large supervised datasets as an open challenge.

References

Thus, designing a dual-play framework for training LLMs that enables sustained improvement across training iterations without requiring large amounts of supervised training data remains an open challenge.

Better LLM Reasoning via Dual-Play (2511.11881 - Zhang et al., 14 Nov 2025) in Section 1 (Introduction)