Detailed analysis of RLVR’s mechanism for selecting optimal reasoning patterns in non-convex LLM policies
Develop a detailed analysis of how Reinforcement Learning with Verifiable Rewards (RLVR) enables large language models with non-convex autoregressive policies to find and select optimal reasoning patterns for a given question, providing training-dynamics and convergence guarantees in the general LLM setting beyond the simplified tabular policy parameterization.
References
However, due to the non-convexity of the policy, a more detailed analysis of how RLVR helps the model find optimal reasoning patterns remains unclear.
— On the Mechanism of Reasoning Pattern Selection in Reinforcement Learning for Language Models
(2506.04695 - Chen et al., 5 Jun 2025) in Section 5.2 (Theoretical Explanation for Empirical Findings)