Extend theory and algorithms to noisy, non-separable settings

Extend the theoretical results for policy-gradient-based post-training of autoregressive linear models under a sequence-level margin separability assumption to settings with noisy and non-separable responses, and develop post-training algorithms that are optimal in those settings (e.g., in terms of accuracy and query/sample complexity) for outcome and process reward models.

Background

The paper establishes sample- and query-complexity guarantees for policy gradient post-training of autoregressive linear models under a separability (margin) assumption, analyzes base-model-dependent barriers with outcome rewards, and shows how process rewards can mitigate these barriers. The results also include minimax lower bounds demonstrating tightness under the separable setting.

However, the analysis assumes accurate process rewards and separability of responses. The authors explicitly raise the need to move beyond this assumption by handling noisy and non-separable responses and to design optimal post-training algorithms for such settings.

References

Another open question is to extend the results of this paper to settings with noisy and non-separable responses, and develop optimal post-training algorithms accordingly.

Post-Training with Policy Gradients: Optimality and the Base Model Barrier  (2603.06957 - Mousavi-Hosseini et al., 7 Mar 2026) in Section 7 (Conclusion)