ROVER: Random Policy Valuation for Reasoning

Updated 30 September 2025

The paper introduces ROVER, a novel reinforcement learning framework that leverages uniform random policy Q-value estimation to bypass iterative updates for deterministic math tasks.
It employs a softmax sampling mechanism over Q-values to promote exploration and maintain diversity by generating multiple valid reasoning paths.
Empirical results show ROVER improves pass@1 by +8.2 and pass@256 by +16.8 on math benchmarks while significantly boosting diversity compared to conventional methods.

Random Policy Valuation for Diverse Reasoning (ROVER) is a reinforcement learning (RL) framework designed to improve the reasoning abilities of LLMs, particularly in mathematical problem solving. ROVER departs from traditional RL training paradigms by forgoing the standard generalized policy iteration and instead capitalizes on the structural properties of math reasoning tasks (deterministic, finite-horizon, tree-structured MDPs with binary rewards) to enable simple yet effective policy valuation through uniform random policy Q-function estimation. This design enables ROVER to preserve diversity in generated reasoning trajectories while streamlining both training and inference, resulting in improvements over more complex conventional techniques.

1. Theoretical Foundations and Main Principle

ROVER is derived from the insight that in specialized deterministic tree-structured MDPs with binary terminal rewards—properties satisfied by LLM-based mathematical reasoning tasks—the optimal action policy can be obtained by evaluating the Q-values of a single fixed uniform random policy, bypassing the need for iterative policy evaluation and improvement steps characteristic of approaches such as PPO and GRPO (He et al., 29 Sep 2025).

Let $\pi_{u}(a|s) = 1/|\mathcal{A}|$ denote the uniformly random policy over action space $\mathcal{A}$ . The associated Q-function is computed using the simplified Bellman equation (with deterministic transitions, $\gamma=1$ ):

$Q^{\pi_{u}}(s, a) = r(s, a) + \frac{1}{|\mathcal{A}|} \sum_{a' \in \mathcal{A}} Q^{\pi_{u}}(s', a')$

The paper provides a formal proof (Theorem 1) that in this regime, a greedy policy $\pi_{greedy}(s) = \arg\max_{a} Q^{\pi_{u}}(s, a)$ achieves optimality. To preserve diversity and mitigate mode collapse, ROVER deploys a softmax sampling over the uniform-policy Q-values:

$\pi_{s}(a|s) = \frac{\exp(Q^{\pi_{u}}(s,a)/\rho)}{\sum_{a'} \exp(Q^{\pi_{u}}(s,a')/\rho)}$

Here, $\rho$ is a temperature hyperparameter governing the exploitation–exploration trade-off. The effect is to encourage exploration of multiple valid reasoning paths and avoid collapse to a single, deterministic strategy.

2. Algorithmic Formulation and Implementation

ROVER's workflow is minimalist yet systematic:

Trajectory Generation: Fix a frozen baseline policy (the current LLM). For each task prompt, generate multiple reasoning trajectories (completions) using the current policy.
Q-Value Computation: For each state–action pair $(s_t, a_t)$ in a trajectory, the Q-value is estimated as:

$Q(s_t, a_t) = \rho \left[\log \pi_{\theta}(a_t|s_t) - \log \pi_{\theta^{old}}(a_t|s_t)\right]$

where $\pi_{\theta}$ is the current policy and $\pi_{\theta^{old}}$ is the frozen reference policy.

Uniform Evaluation: Compute the mean next-step Q-value over possible actions to approximate the uniform random policy evaluation:

$Q' = \frac{1}{|\mathcal{V}|} \sum_{a_{t+1} \in \mathcal{V}} Q(s_{t+1}, a_{t+1})$

Reward Centering: Define a centered reward for each response by subtracting the average group reward:

$\tilde{r}(x, y) = r(x, y) - \frac{1}{n} \sum_{i=1}^{n} r(x, y^{(i)})$

This reduction in reward variance serves to stabilize learning.

Loss and Update: Use mean squared error between the computed Q-values and their targets (centered reward + Q'):

$\text{Loss} = \frac{1}{T} \sum_{t=1}^{T} \left(Q(s_t, a_t) - [\tilde{r}(x, y) + Q']\right)^2$

Parameters are updated via gradient descent (e.g., AdamW).

Sampling: At each decision point, the next token is sampled according to the softmax policy over Q-values, maintaining trajectory diversity.

This procedure completely circumvents iterative value/policy updates and the associated heuristics (e.g., importance sampling, KL constraints, ratio clipping), resulting in significantly simplified and robust code.

3. Empirical Results and Diversity Metrics

ROVER demonstrates superior empirical performance on mathematical reasoning benchmarks, including AIME24, AIME25, HMMT25, and GPQA-diamond (He et al., 29 Sep 2025). Key quantitative results documented in the paper include:

Quality: On the Qwen3-8B-Base model, ROVER improves pass@1 by +8.2 and pass@256 by +16.8 compared to strong baselines such as GRPO.
Diversity: ROVER achieves a relative diversity metric increase of +17.6% on carefully designed reasoning diversity benchmarks. Metrics include the number of distinct solution strategies and higher maintained policy entropy during training.

The maintained diversity is directly attributed to the sampling mechanism from the softmax over uniform-policy Q-values, which enables the system to generate and explore multiple potential solution paths for the same problem instance.

4. Theoretical Guarantees and Analysis

ROVER is theoretically justified by the following findings:

Optimality with Uniform Policy Evaluation: In deterministic, tree-structured MDPs with binary rewards, the Q-function of the uniform random policy encodes sufficient information to construct the optimal policy via greedy (or softmax-based) action selection. This extends to a guarantee that as the softmax temperature $\rho \rightarrow 0$ , the sampling policy converges to the optimal greedy solution (Theorem 2).
Diversity–Optimality Tradeoff: The softmax sampling formulation provides an explicit tradeoff: reducing $\rho$ increases exploitation while setting a higher $\rho$ preserves more exploration and diversity. Lower bounds on the value function $V^{\pi_s}$ in terms of $\rho$ are established, quantifying policy performance and diversity.

The theoretical analysis underscores that ROVER's random policy valuation, despite the uniform random policy itself being “sub-optimal” in reward expectation, offers optimal guidance for action selection in this domain.

5. Practical Implications and Applications

ROVER's architectural simplicity and theoretical grounding have implications across multiple axes:

Training Stability: By dispensing with iterative policy improvement and reward signal hacking, ROVER ensures robust convergence and resists the training collapses observed with PPO/GRPO on math tasks.
Diversity Preservation: The softmax sampling of Q-values is an intrinsic diversity-promoting mechanism, beneficial for domains where solution pluralism (multiple correct reasoning paths) is critical.
Generalization: While currently tailored for deterministic and binary reward settings (e.g., mathematical reasoning), the framework's principles suggest avenues for extension to less-structured domains, though further research is necessary for cases with stochastic transitions or multi-valued rewards.
Resource Efficiency: Simpler code and fewer training heuristics lead to easier scaling, maintenance, and interpretability, lowering the barrier for adoption in production and research settings.

6. Limitations and Future Research Directions

The foundational assumptions of ROVER constrain its domain of applicability:

Deterministic, Tree-Structured MDPs: The theoretical results rely on the absence of stochastic state transitions and intermediate rewards. Direct application outside this regime would require substantive adaptation.
Reward Feedback Granularity: Binary terminal rewards are well-suited for math and logic tasks but may be limiting in settings where reward shaping is necessary.
Extension to Other Domains: Application to dialogue, multimodal reasoning, or subjective tasks remains unstudied and would likely require hybridization with more general RL strategies or problem restructuring.

Future research directions outlined include algorithmic refinements for tasks with large action spaces and long reasoning horizons, approximate uniform policy evaluation for scaling, and generalizations to more complex, stochastic RL settings.

7. Relationship to Broader Literature on Diverse Reasoning

ROVER is situated within an emerging trend of RL methods seeking to maximize both solution quality and diversity in LLM reasoning:

It connects to structured sampling approaches (e.g., conjugate policies maximizing pairwise KL divergence (Cohen et al., 2019)), principled objective formulations balancing diversity and quality (Ghasemi et al., 2020), and explicit diversity regularization in value function estimation (Yu et al., 9 Apr 2024).
ROVER is distinguished by its radical simplification—eschewing generalized policy iteration in favor of a one-shot random policy valuation—while outperforming or matching considerably more elaborate frameworks with less implementation complexity.

This paradigm delivers a strong empirical and theoretical foundation for further work in reward-driven, diversity-preserving reasoning frameworks for large-scale LLMs.