Papers
Topics
Authors
Recent
2000 character limit reached

Rationale-Aware Reward (RARE) in RL

Updated 14 January 2026
  • RARE is a reinforcement learning objective that rewards both the correctness and the uniqueness of reasoning strategies to prevent exploration collapse.
  • It leverages an LLM-based judge to cluster solution rollouts by high-level strategy and weights rewards inversely to cluster size.
  • Experimental results show improved diversity-sensitive metrics and sustained exploratory behavior across benchmarks without sacrificing pass@1 performance.

Rationale-Aware Reward (RARE) is a reinforcement learning (RL) objective formalism designed to mitigate exploration collapse in LLMs fine-tuned for complex problem solving. Exploration collapse refers to the undesirable phenomenon where a policy over-concentrates on a narrow set of dominant reasoning patterns, typically optimizing for pass@1 but sacrificing solution diversity and pass@k for larger kk. RARE addresses this by explicitly rewarding not only correctness but also the uniqueness of the high-level reasoning strategies in model-generated solutions. This is operationalized by leveraging an LLM-based judge to cluster rollouts by the underlying solution strategy, then reweighting RL rewards inversely with cluster size, thus amplifying incentives for correct but rare approaches (Hu et al., 13 Jan 2026).

1. Rollout-Level Reinforcement Learning Objective

RARE extends group-based policy-gradient methods such as GRPO to operate at the level of solution rollouts rather than individual tokens. For each problem mm in a training batch, KK candidate solutions pm,kp_{m,k} (including both chain-of-thought and answer) are generated. Each is scored via a task-specific verifier to yield a reward rm,kr_{m,k}. Standard group-normalized advantages are first computed: zm,k=rm,kμmσm+εz_{m,k} = \frac{r_{m,k} - \mu_m}{\sigma_m + \varepsilon} where μm\mu_m and σm\sigma_m are the sample mean and standard deviation for rewards over the KK rollouts for problem mm. RARE then introduces an inverse-cluster-weighted scheme: if rollout pm,kp_{m,k} falls in a strategy cluster of size fm,kf_{m,k}, it receives a weight

wm,k=1(fm,k)αw_{m,k} = \frac{1}{(f_{m,k})^{\alpha}}

with α[0,1]\alpha\in[0,1] as a hyperparameter modulating the penalty for redundancy. The final RARE advantage is defined as

advm,k=wm,kzm,k\text{adv}_{m,k} = w_{m,k} z_{m,k}

The RL objective becomes

J(θ)=Em,k[advm,klogπθ(pm,km)]+(KL/clip regularization)J(\theta) = \mathbb{E}_{m,k}\Bigl[ \text{adv}_{m,k} \log\pi_\theta(p_{m,k}|m) \Bigr] + \textrm{(KL/clip regularization)}

For α=0\alpha=0, standard group-policy RL (no uniqueness bonus) is recovered; higher α\alpha increases reward for rare strategies.

2. Clustering Rollouts by Solution Strategy

RARE's critical innovation is the use of an external, large LLM as a judge JJ to partition rollouts for the same problem into clusters based on high-level solution strategies rather than surface features. The process consists of:

  • Presenting all KK rollout traces to the judge with prompts engineered to elicit grouping by underlying approach (e.g., "factorization," "symmetry argument"), explicitly disregarding superficial differences.
  • The judge outputs a structured mapping (e.g., JSON or Python list) assigning rollouts to cluster indices.
  • Each rollout's cluster size fm,kf_{m,k} is determined by the cardinality of its assigned cluster.

This approach enables robust estimation of the frequency of each reasoning pattern, so redundancy can be quantitatively discouraged.

3. Training Pipeline and Algorithmic Details

The RARE training procedure can be summarized as follows:

  1. For each problem mm in batch B\mathcal B, sample KK rollouts pm,1,...,pm,Kp_{m,1},...,p_{m,K} with temperature T=1.0T=1.0.
  2. Score each rollout with a domain-specific verifier yielding rm,kr_{m,k}.
  3. Compute group-normalized advantages zm,kz_{m,k}.
  4. Cluster rollouts via the LLM judge JJ, extract cluster assignments c(k)c(k) and compute fm,kf_{m,k}.
  5. Set uniqueness weights wm,k=1/fm,kαw_{m,k}=1/f_{m,k}^\alpha.
  6. Compute advantages: advm,k=wm,kzm,k\text{adv}_{m,k}=w_{m,k}z_{m,k}.
  7. Take a policy-gradient step on the RARE objective, using KL/clip regularization to ensure stability.

The distinction of RARE is the integration of rollout-level clustering into the gradient signal, which reshapes the RL landscape to prioritize correct-but-uncommon strategies.

4. Experimental Results and Empirical Behavior

Empirical evaluation demonstrates the efficacy of RARE across mathematics (AIME 2024/25, HLE), physics (OlympiadBench), and medicine (MedCaseReasoning) benchmarks using Qwen2.5-7B, OLMo-3-7B, and Qwen3-8B as policy models (with 32B judge models).

Key findings include:

  • Substantial improvements in diversity-sensitive metrics: For Qwen2.5-7B at K=64K=64, RARE increases AIME AUC@64 by 0.044 (0.116→0.160) and HLE by 0.026 (0.112→0.138). At larger KK (e.g., 128), gains persist and even broaden. Larger backbones (e.g., Qwen3-8B) exhibit similar trends, with RARE outperforming existing diversity-aware RL objectives (DAPO, Forking-Token).
  • No pass@1 degradation: Pass@1 remains uncompromised throughout, while pass@kk for k32k\geq32 improves.
  • Sustained exploration: Token-level entropy remains higher and more stable with RARE than with standard group-policy RL, indicating successful prevention of mode collapse.
  • Enhanced strategy coverage: On a curated set of 20 AIME problems with several canonical solution strategies each, RARE increases cover@32 (fraction of distinct strategies found) and in certain cases recovers rare but human-relevant solutions, e.g., "Symmedian Similarity" and flow-based approaches.

5. Hyperparameterization and Theoretical Properties

The key tunable parameter in RARE is α\alpha, determining the steepness of the penalty for frequently repeated strategies. α=0\alpha=0 yields the standard, non-diversity-aware RL regime, while values approaching $1$ strongly favor rare strategies. Typically, α\alpha is tuned on a held-out validation set for optimal balance between diversity and sample efficiency.

Theoretical properties of RARE include:

  • Direct modulation of policy entropy via the advantage term, targeting global solution diversity rather than merely local token-level variance.
  • Recovery of standard group-policy objectives in the α0\alpha\to 0 limit.

6. Limitations and Prospects for Future Research

Limitations currently inherent to RARE include:

  • Inference overhead: Each training batch necessitates an external LLM judge call for clustering, introducing computational and latency costs.
  • Local rarity estimation: Only intra-problem, per-batch uniqueness is considered; global or historical novelty is not tracked.
  • Dependence on judge accuracy: Success hinges on reliable prompting and the ability of the judge model to consistently cluster on underlying strategy rather than superficial similarity.

Proposed future directions include:

  • Developing embedding-based or lightweight (possibly self-supervised) clustering in place of LLM judging.
  • Persisting diversity signals across batches and training epochs to encourage global rather than batch-local exploration.
  • Joint optimization for diversity at both local and global scales.
  • Exploring judge-free or contrastive representations for strategy clustering.

7. Relation to Broader Rationale-Aware and Diversity-Seeking RL Approaches

RARE constitutes a family of rationale-aware, diversity-seeking RL frameworks distinguished by the explicit use of solution rationale clustering to shape the reward landscape. While related approaches such as DAPO, Forking-Token, and rationale-enhanced inference-time decoders like RED (Yamaguchi et al., 10 Jul 2025) target diversity at the token or inference level, RARE is unique in analytically incentivizing global rollout-level diversity during RL fine-tuning. Unlike models which learn to generate or exploit rationales in reward modeling (e.g., GRAM-R2^2 (Wang et al., 2 Sep 2025)), RARE integrates automated rationale grouping directly into its policy-gradient signal. This positions RARE as a principled framework for direct rollout-level diversity control in RL-fine-tuned LLMs, with demonstrated impact on solution coverage and sustained exploratory behavior (Hu et al., 13 Jan 2026).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Rationale-Aware Reward (RARE).