Reward Balanced Search (REBASE)

Updated 27 February 2026

Reward Balanced Search (REBASE) is a framework that integrates learned reward models into candidate generation, balancing reasoning depth, computational cost, and solution diversity.
It employs methods akin to Monte Carlo Tree Search and dual-phase decomposition, combining softmax-based node expansion and UCB selection for efficient exploration and exploitation.
Empirical results demonstrate significant gains, such as 70.8% accuracy on MATH-OAI and improved computational efficiency compared to standard chain-of-thought and beam search approaches.

Reward Balanced Search (REBASE) designates a class of inference, optimization, and training algorithms that integrate learned reward models into strategic candidate generation or search for improved accuracy, interpretability, or efficiency. In all instantiations, model proposals or expansions are evaluated and selected based on reward signals, enabling explicit trade-offs between reasoning depth, computational cost, and the preservation of expert-like or diverse solutions. REBASE has been advanced in mathematical reasoning for LLMs, search relevance modeling, and code generation, featuring variants such as reward-guided tree search, mode-balanced RL, and dual-phase (plan-execute) decomposition (Jiang et al., 2024, Cui et al., 29 Sep 2025, Zhang et al., 10 Feb 2026, AbdElhameed et al., 2024). The following presents the conceptual foundation, main algorithms, empirical outcomes, and ongoing challenges in the development of REBASE methods.

1. Algorithmic Foundations

REBASE frameworks orchestrate inference or training as a search over solution or reasoning trees, balancing between policy-driven expansion and reward evaluation. The canonical structure, as in STILL-1, interleaves four principal operations reminiscent of Monte-Carlo Tree Search (MCTS): selection, expansion, simulation (rollout), and backpropagation. At each expansion, a LLM policy proposes $k$ new reasoning steps; a reward model then scores completions, with scores propagated upwards to guide further search (Jiang et al., 2024). In more compute-efficient variants for inference-time reasoning, node expansion probability is proportional to a softmax over reward scores $R(n)$ , with per-expansion budgets and stochastic sampling to trade off exploitation and exploration (AbdElhameed et al., 2024, Cui et al., 29 Sep 2025).

In hybrid mode-balanced optimization frameworks, REBASE further unifies forward-KL (SFT/maximum-likelihood—mode-covering) and reverse-KL (RL/reward-seeking—mode-seeking) objectives in a weighted loss for training, crucial for preventing collapse onto high-reward but incorrect or non-diverse solutions (Zhang et al., 10 Feb 2026).

2. Reward Model Construction and Integration

The reward model $\mathcal{R}$ is typically a fine-tuned LLM that either scores completed solutions or assigns correctness probabilities to intermediate states. For outcome-supervised settings, the model produces probabilities (e.g., “Yes”/“No”) for chain-of-thought correctness, yielding scores such as

$\mathrm{Score}(\tau) = \frac{e^{p_Y}}{e^{p_Y} + e^{p_N}}$

for a chain $\tau$ (Jiang et al., 2024). In dual-phase REBASE, separate reward models are trained for plan selection and execution evaluation: $R(x) = P(+|x) = \frac{\exp(\ell_+(x))}{\exp(\ell_+(x)) + \exp(\ell_-(x))}$ where $\ell_{\pm}(x)$ are logits for positive/negative outcomes (Cui et al., 29 Sep 2025).

Active learning—mining hard positives and negatives for iterative reward model refinement—substantially strengthens these models. In some frameworks, reward estimation at intermediate steps mitigates the risk of early-pruning valid, partially correct solutions (Jiang et al., 2024, Cui et al., 29 Sep 2025).

Policy and reward integration operates in both inference (as the scoring mechanism for tree/beam selection) and training (as the RL advantage signal). Some implementations combine policy logits, reward scores, and an exploration bonus in a single utility for node ranking: $U(s_c) = \alpha \log \pi(s_c|\text{prefix}) + (1-\alpha)V(s_c) + c\sqrt{\frac{\ln N(\text{parent})}{1+N(s_c)}}$ (Jiang et al., 2024).

3. Selection Criteria and Resource Control

Two principal selection schemes are used:

Local UCB: At each node, expansion maximizes a UCB-type objective incorporating average reward and visitation statistics:

$\mathrm{UCB}(s_j) = V(s_j) + c\sqrt{\frac{\ln N(s_t)}{1 + N(s_j)}}$

Global Thresholding: Leaf nodes are ranked according to their mean $\mu$ and standard deviation $R(n)$ 0 of $R(n)$ 1; those exceeding $R(n)$ 2 are expanded, directly controlling exploration vs. exploitation via $R(n)$ 3 (Jiang et al., 2024).

Compute-efficient REBASE variants introduce dynamic or static expansion budgets (e.g., number of allowed expansions, FLOP or wall-clock constraints) (AbdElhameed et al., 2024). Dual-phase REBASE adapts budget allocation during planning and execution, using reward thresholds to perform early stopping on easy steps and reallocating samples toward harder ones (Cui et al., 29 Sep 2025). Such budget adaptation is crucial for test-time efficiency and enables practical deployment in latency-sensitive applications.

4. Empirical Results and Trade-offs

Performance is consistently benchmarked on mathematical reasoning (GSM8K, MATH-OAI, OlympiadBench) and code generation (MBPP+, HumanEval+) datasets. For STILL-1, REBASE achieves substantial accuracy improvements over standard Chain-of-Thought (CoT) baselines—e.g., 70.8% on MATH-OAI with LLaMA-3.1-8B compared to 48.2% for CoT, and robust gains across several reasoning-heavy benchmarks (Jiang et al., 2024).

Efficiency-focused REBASE implementations demonstrate Pareto-optimal tradeoffs: on GSM8K, REBASE attains 10.94% accuracy at 2.35T FLOPs in 8.47s (width=3), far lower resource usage than baseline models or more exhaustive tree search strategies (e.g., Quiet-STaR reaches 32.03% accuracy but at 12× the latency and compute) (AbdElhameed et al., 2024). Dual-phase approaches further elevate sample efficiency and accuracy, outperforming standard beams by 8–12% absolute at fixed token budgets and enhancing both math reasoning and code generation accuracy (Cui et al., 29 Sep 2025).

However, increased pruning aggressiveness or low tree/beam width may lead to shallow search and missed valid solutions, while over-reliance on reward may yield reward hacking or mode collapse.

5. Optimization, Regularization, and Mode Collapse

REBASE's “Mode-Balanced Optimization” addresses the recurrent issue whereby RL-trained LLMs, minimizing reverse-KL, favor high-probability modes at the expense of solution diversity (mode collapse). Integrating an auxiliary SFT loss (forward-KL) regularizes the policy, ensuring coverage of long-tail rules and preventing collapse onto shortcut or degenerate patterns (Zhang et al., 10 Feb 2026). In multi-stage curricula, the RL/SFT weighting is scheduled to shift emphasis from SFT-based coverage ( $R(n)$ 4) to RL precision ( $R(n)$ 5) as learning progresses.

The empirical effect is stabilization of policy entropy, preservation of ranking quality (Pair-ACC), and improved sample efficiency relative to vanilla RL approaches. Ablation demonstrates that dropping either the SFT auxiliary term or the curriculum induces rapid degradation in diversity and ranking performance.

6. Extensions, Practical Recommendations, and Open Challenges

REBASE frameworks generalize beyond mathematical problem-solving to search relevance ranking, code synthesis, and multi-agent planning. Advanced forms incorporate dual-phase (plan-exec) decomposition, early stopping via reward thresholds, and hybridization with explicit rationale-generation mechanisms. Key recommendations include training reward models on both final answers and intermediate chains, dynamically adjusting tree width or allocation based on estimated step difficulty, and integrating hidden representations between chain generation and reward scoring modules (AbdElhameed et al., 2024, Cui et al., 29 Sep 2025).

Persisting challenges include reward model calibration, tuning of budget-related hyperparameters, and the non-trivial interplay between reward-guided search breadth, depth, and final accuracy. Empirical work also suggests that naively integrating reward-based pruning with deep rationale-generation may degrade performance, indicating a need for architectures that reconcile deep reasoning with compute efficiency (AbdElhameed et al., 2024). Theoretical development of scaling laws linking inference budget, reward signal-to-noise, and accuracy remains an open direction.

References:

(Jiang et al., 2024, Zhang et al., 10 Feb 2026, AbdElhameed et al., 2024, Cui et al., 29 Sep 2025)