Principled Demonstration Construction and Selection for CBRL

Develop principled methods for constructing and selecting few-shot demonstrations used by Context Bootstrapped Reinforcement Learning (CBRL) to be prepended to training prompts, for the purpose of improving alignment between demonstrations and training instances and enhancing performance across tasks; investigate learned retrieval mechanisms to automate this process.

Background

Context Bootstrapped Reinforcement Learning (CBRL) addresses exploration inefficiency in Reinforcement Learning from Verifiable Rewards by stochastically injecting few-shot demonstrations into training prompts with an annealed probability schedule. This scaffolding boosts early exploration and is phased out over training to encourage autonomous performance.

The paper reports consistent gains across multiple tasks and models, but also observes task-dependent variation and cases where performance decreases, suggesting that the match between demonstrations and task structure is crucial. The authors highlight that systematically constructing and selecting demonstrations is unresolved, and suggest that learned retrieval mechanisms may improve alignment between examples and training instances.

References

Second, developing principled methods for constructing and selecting demonstrations remains an open challenge; learned retrieval mechanisms could automate this process while improving alignment between examples and training instances.

Context Bootstrapped Reinforcement Learning  (2603.18953 - Agashe et al., 19 Mar 2026) in Section: Future Work