Rationale-Aware Reward (RARE) in RL

Updated 14 January 2026

RARE is a reinforcement learning objective that rewards both the correctness and the uniqueness of reasoning strategies to prevent exploration collapse.
It leverages an LLM-based judge to cluster solution rollouts by high-level strategy and weights rewards inversely to cluster size.
Experimental results show improved diversity-sensitive metrics and sustained exploratory behavior across benchmarks without sacrificing pass@1 performance.

Rationale-Aware Reward (RARE) is a reinforcement learning (RL) objective formalism designed to mitigate exploration collapse in LLMs fine-tuned for complex problem solving. Exploration collapse refers to the undesirable phenomenon where a policy over-concentrates on a narrow set of dominant reasoning patterns, typically optimizing for pass@1 but sacrificing solution diversity and pass@k for larger $k$ . RARE addresses this by explicitly rewarding not only correctness but also the uniqueness of the high-level reasoning strategies in model-generated solutions. This is operationalized by leveraging an LLM-based judge to cluster rollouts by the underlying solution strategy, then reweighting RL rewards inversely with cluster size, thus amplifying incentives for correct but rare approaches (Hu et al., 13 Jan 2026).

1. Rollout-Level Reinforcement Learning Objective

RARE extends group-based policy-gradient methods such as GRPO to operate at the level of solution rollouts rather than individual tokens. For each problem $m$ in a training batch, $K$ candidate solutions $p_{m,k}$ (including both chain-of-thought and answer) are generated. Each is scored via a task-specific verifier to yield a reward $r_{m,k}$ . Standard group-normalized advantages are first computed: $z_{m,k} = \frac{r_{m,k} - \mu_m}{\sigma_m + \varepsilon}$ where $\mu_m$ and $\sigma_m$ are the sample mean and standard deviation for rewards over the $K$ rollouts for problem $m$ . RARE then introduces an inverse-cluster-weighted scheme: if rollout $p_{m,k}$ falls in a strategy cluster of size $f_{m,k}$ , it receives a weight

$w_{m,k} = \frac{1}{(f_{m,k})^{\alpha}}$

with $\alpha\in[0,1]$ as a hyperparameter modulating the penalty for redundancy. The final RARE advantage is defined as

$\text{adv}_{m,k} = w_{m,k} z_{m,k}$

The RL objective becomes

$J(\theta) = \mathbb{E}_{m,k}\Bigl[ \text{adv}_{m,k} \log\pi_\theta(p_{m,k}|m) \Bigr] + \textrm{(KL/clip regularization)}$

For $\alpha=0$ , standard group-policy RL (no uniqueness bonus) is recovered; higher $\alpha$ increases reward for rare strategies.

2. Clustering Rollouts by Solution Strategy

RARE's critical innovation is the use of an external, large LLM as a judge $J$ to partition rollouts for the same problem into clusters based on high-level solution strategies rather than surface features. The process consists of:

Presenting all $K$ rollout traces to the judge with prompts engineered to elicit grouping by underlying approach (e.g., "factorization," "symmetry argument"), explicitly disregarding superficial differences.
The judge outputs a structured mapping (e.g., JSON or Python list) assigning rollouts to cluster indices.
Each rollout's cluster size $f_{m,k}$ is determined by the cardinality of its assigned cluster.

This approach enables robust estimation of the frequency of each reasoning pattern, so redundancy can be quantitatively discouraged.

3. Training Pipeline and Algorithmic Details

The RARE training procedure can be summarized as follows:

For each problem $m$ in batch $\mathcal B$ , sample $K$ rollouts $p_{m,1},...,p_{m,K}$ with temperature $T=1.0$ .
Score each rollout with a domain-specific verifier yielding $r_{m,k}$ .
Compute group-normalized advantages $z_{m,k}$ .
Cluster rollouts via the LLM judge $J$ , extract cluster assignments $c(k)$ and compute $f_{m,k}$ .
Set uniqueness weights $w_{m,k}=1/f_{m,k}^\alpha$ .
Compute advantages: $\text{adv}_{m,k}=w_{m,k}z_{m,k}$ .
Take a policy-gradient step on the RARE objective, using KL/clip regularization to ensure stability.

The distinction of RARE is the integration of rollout-level clustering into the gradient signal, which reshapes the RL landscape to prioritize correct-but-uncommon strategies.

4. Experimental Results and Empirical Behavior

Empirical evaluation demonstrates the efficacy of RARE across mathematics (AIME 2024/25, HLE), physics (OlympiadBench), and medicine (MedCaseReasoning) benchmarks using Qwen2.5-7B, OLMo-3-7B, and Qwen3-8B as policy models (with 32B judge models).

Key findings include:

Substantial improvements in diversity-sensitive metrics: For Qwen2.5-7B at $K=64$ , RARE increases AIME AUC@64 by 0.044 (0.116→0.160) and HLE by 0.026 (0.112→0.138). At larger $K$ (e.g., 128), gains persist and even broaden. Larger backbones (e.g., Qwen3-8B) exhibit similar trends, with RARE outperforming existing diversity-aware RL objectives (DAPO, Forking-Token).
No pass@1 degradation: Pass@1 remains uncompromised throughout, while pass@ $k$ for $k\geq32$ improves.
Sustained exploration: Token-level entropy remains higher and more stable with RARE than with standard group-policy RL, indicating successful prevention of mode collapse.
Enhanced strategy coverage: On a curated set of 20 AIME problems with several canonical solution strategies each, RARE increases cover@32 (fraction of distinct strategies found) and in certain cases recovers rare but human-relevant solutions, e.g., "Symmedian Similarity" and flow-based approaches.

5. Hyperparameterization and Theoretical Properties

The key tunable parameter in RARE is $\alpha$ , determining the steepness of the penalty for frequently repeated strategies. $\alpha=0$ yields the standard, non-diversity-aware RL regime, while values approaching $1$ strongly favor rare strategies. Typically, $\alpha$ is tuned on a held-out validation set for optimal balance between diversity and sample efficiency.

Theoretical properties of RARE include:

Direct modulation of policy entropy via the advantage term, targeting global solution diversity rather than merely local token-level variance.
Recovery of standard group-policy objectives in the $\alpha\to 0$ limit.

6. Limitations and Prospects for Future Research

Limitations currently inherent to RARE include:

Inference overhead: Each training batch necessitates an external LLM judge call for clustering, introducing computational and latency costs.
Local rarity estimation: Only intra-problem, per-batch uniqueness is considered; global or historical novelty is not tracked.
Dependence on judge accuracy: Success hinges on reliable prompting and the ability of the judge model to consistently cluster on underlying strategy rather than superficial similarity.

Proposed future directions include:

Developing embedding-based or lightweight (possibly self-supervised) clustering in place of LLM judging.
Persisting diversity signals across batches and training epochs to encourage global rather than batch-local exploration.
Joint optimization for diversity at both local and global scales.
Exploring judge-free or contrastive representations for strategy clustering.

7. Relation to Broader Rationale-Aware and Diversity-Seeking RL Approaches

RARE constitutes a family of rationale-aware, diversity-seeking RL frameworks distinguished by the explicit use of solution rationale clustering to shape the reward landscape. While related approaches such as DAPO, Forking-Token, and rationale-enhanced inference-time decoders like RED (Yamaguchi et al., 10 Jul 2025) target diversity at the token or inference level, RARE is unique in analytically incentivizing global rollout-level diversity during RL fine-tuning. Unlike models which learn to generate or exploit rationales in reward modeling (e.g., GRAM-R $^2$ (Wang et al., 2 Sep 2025)), RARE integrates automated rationale grouping directly into its policy-gradient signal. This positions RARE as a principled framework for direct rollout-level diversity control in RL-fine-tuned LLMs, with demonstrated impact on solution coverage and sustained exploratory behavior (Hu et al., 13 Jan 2026).