Extend theory and algorithms to noisy, non-separable settings
Extend the theoretical results for policy-gradient-based post-training of autoregressive linear models under a sequence-level margin separability assumption to settings with noisy and non-separable responses, and develop post-training algorithms that are optimal in those settings (e.g., in terms of accuracy and query/sample complexity) for outcome and process reward models.
References
Another open question is to extend the results of this paper to settings with noisy and non-separable responses, and develop optimal post-training algorithms accordingly.
— Post-Training with Policy Gradients: Optimality and the Base Model Barrier
(2603.06957 - Mousavi-Hosseini et al., 7 Mar 2026) in Section 7 (Conclusion)