Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bi-directional Reward Models (BiRM)

Updated 14 June 2026
  • BiRM is a neural framework that computes rewards using both historical correctness and future outcome estimates to guide complex reasoning.
  • It integrates backward scores with forward-looking value estimates, distinguishing promising partial solutions from dead-ends.
  • Variants like BiPRM and TinyRM demonstrate improved efficiency and accuracy over traditional unidirectional reward models in LLM applications.

A bi-directional reward model (BiRM) is a neural process supervision framework in which reward assignment to intermediate or full trajectories is computed using signals from both the past (backward; historical reasoning correctness) and the future (forward; estimated probability of eventual solution correctness). Unlike traditional unidirectional reward models, BiRMs exploit the full trajectory context or combine stepwise correctness with prospective value estimates, resulting in improved reward fidelity for complex reasoning tasks in LLMs. This bi-directional paradigm underlies several architectural and algorithmic advances across process reward modeling for mathematical reasoning, stepwise evaluation, and test-time preference selection (Zhang et al., 3 Aug 2025, Chen et al., 6 Mar 2025, Pan, 14 Jul 2025).

1. Theoretical Motivations and Foundations

Conventional process reward models (PRMs) operate in a left-to-right (L2R) fashion, assigning each reasoning step sts_t a reward rtL2Rr_t^{\mathrm{L2R}} using only the input question qq and the prior steps s<ts_{<t}. This unidirectionality yields a “myopic” property: rtL2Rst+k=0k>0,\frac{\partial r_t^{\mathrm{L2R}}}{\partial s_{t+k}} = 0 \quad\forall k>0, so errors or knowledge from future steps cannot revise earlier reward assignments. As a result, PRMs perform well at early error localization but cannot verify, correct, or contextualize earlier steps based on later developments, leading to reduced discriminative power for partial solutions unlikely to succeed.

The motivation for bi-directionality draws on optimal search strategies such as A*: f(n)=g(n)+h(n)f(n) = g(n) + h(n) where g(n)g(n) is incurred (backward) cost and h(n)h(n) is a heuristic (forward) estimate to goal. BiRM analogues combine a backward score g(st)g(s_t)—aggregating the correctness of steps up to tt—with a forward-looking score rtL2Rr_t^{\mathrm{L2R}}0, estimating the probability that continuations from the current prefix will reach a correct final answer. This fusion enables models to distinguish “dead-ends” versus genuinely promising partial solutions, mitigating early over-confidence and late under-confidence in classical evaluators (Chen et al., 6 Mar 2025).

2. Model Formulations and Architectural Variants

Several bi-directional process reward model families have emerged:

2.1. Explicit Value-Augmented BiRM (A*-Style)

BiRM (Chen et al., 6 Mar 2025) attaches two reward heads to a shared encoder:

  • A “reward head” rtL2Rr_t^{\mathrm{L2R}}1 predicting per-step correctness probability rtL2Rr_t^{\mathrm{L2R}}2,
  • A “value head” rtL2Rr_t^{\mathrm{L2R}}3 estimating the probability that a partial trajectory rtL2Rr_t^{\mathrm{L2R}}4 leads to a correct answer, rtL2Rr_t^{\mathrm{L2R}}5.

The stepwise score is

rtL2Rr_t^{\mathrm{L2R}}6

where rtL2Rr_t^{\mathrm{L2R}}7 is often the average or product over prior step rewards, and rtL2Rr_t^{\mathrm{L2R}}8 is the estimated forward value. rtL2Rr_t^{\mathrm{L2R}}9 is tuned per backbone and dataset.

2.2. Bidirectional Prompt Reversal (BiPRM)

The Bidirectional Process Reward Model (BiPRM) (Zhang et al., 3 Aug 2025) introduces dual evaluation by feeding both the canonical L2R trajectory and its reversed right-to-left (R2L) version as separate prompts to the same reward model qq0, sharing all parameters and introducing no inference overhead when parallelized. The R2L context for qq1 includes all subsequent steps qq2.

The fused per-step BiPRM score is: qq3 where R2L employs only prompt reordering. Trajectory-level scoring aggregates these via operators such as min, max, average, or product.

2.3. Bidirectional Masked LLM Rewarding

TinyRM (Pan, 14 Jul 2025) exemplifies a bidirectional approach in preference/reward modeling using masked LLM (MLM) encoders. Full instruction–response pairs are presented bidirectionally, allowing the encoder to leverage the global sequence context when predicting preference or safety tokens at a mask location. This results in improved reasoning and safety benchmarking, with strong performance despite using significantly fewer parameters than decoder-based models.

3. Training Objectives, Data, and Calibration

Bidirectional reward models require supervision for both stepwise correctness and future outcome probability:

  • Backward reward labels: per-step binary labels qq4 derived from reference solutions (e.g., “is this step correct?”).
  • Forward/value labels: for each partial prefix, value labels qq5 are computed via Monte Carlo rollouts, outcome propagation, or fraction of correct completions.
  • Joint training objective (BiRM (Chen et al., 6 Mar 2025)):

qq6

where qq7 balances the importance of value modeling and stepwise correctness.

For BiPRM (Zhang et al., 3 Aug 2025), both L2R and R2L streams are used in training, and rewards are fused accordingly. Loss terms match those of the underlying PRM objectives (BCE, MSE, Q-value ranking, etc.).

TinyRM (Pan, 14 Jul 2025) is trained with a cloze-style MLM loss, freezing bottom layers and fine-tuning only top layers plus low-rank adapters (using Directional Low-Rank Adaptation, DoRA), optimizing: qq8

Annotation of forward value labels via “soft” Monte Carlo rollouts provides superior future success estimation.

4. Inference Algorithms and Practical Deployment

4.1. Trajectory Selection

For best-of-qq9 sampling, all candidates are scored using BiRM or BiPRM policies, typically by computing s<ts_{<t}0 for the final step of each trajectory. The trajectory maximizing s<ts_{<t}1 is selected.

In beam search, stepwise BiRM scores re-rank beam expansions at each stage, using current backward rewards and forward value predictions to guide exploration (see Algorithm 1 in (Chen et al., 6 Mar 2025)). BiPRM maintains parallel evaluation using both forward and reversed prompts per step.

4.2. Efficiency and Scalability

BiPRM (Zhang et al., 3 Aug 2025) entails zero extra parameters (the same s<ts_{<t}2 is used for both L2R and R2L) and, if parallelized, incurs no wall-clock latency penalty. Only stepwise inference costs are doubled if executed serially, which is minor relative to full LLM sampling. TinyRM (Pan, 14 Jul 2025) achieves further efficiency through masked LM encoders and low-rank adaptation, yielding sub-millisecond latency on modern accelerators even at over 400M parameters.

5. Empirical Performance and Quantitative Evaluation

BiRM and BiPRM consistently surpass unidirectional PRM baselines in mathematical reasoning, process supervision, and preference modeling.

5.1. Process Supervision and Mathematical Reasoning

Model/Setting Dataset Best-of-N (N) Baseline BiRM/BiPRM Absolute Gain Relative Gain
BiPRM (Rho-1B, BCE, MM-13B) MATH500 BON@128 25.08 33.08 +8.00 +31.9%
BiPRM (Qwen2.5-1.5B, BCE, MM) MATH500 BON@128 24.20 36.60 +12.40 +47.1%
BiRM (Qwen2.5-7B) Gaokao23 BoN@512 47.3 50.4 +3.1
BiRM (Qwen2.5-7B) MATH-500 BoN@512 58.4 63.4 +5.0

BiPRM outperforms L2R PRM in all 54 configurations (backbones × objectives × sampling policies × datasets) (Zhang et al., 3 Aug 2025). BiRM achieves up to 5.0 percentage point improvement over objective reward models (ORMs) and outperforms PRMs in both sampling and beam search regimes (Chen et al., 6 Mar 2025).

5.2. Bidirectional Preference Modeling

On RewardBench, TinyRM (400M parameters) matches or exceeds 70B decoder models for reasoning (91.2% vs 90.6%) and achieves 89.3% accuracy on safety, while drastically reducing inference compute (Pan, 14 Jul 2025).

Model Reasoning Safety Chat Overall
ModernBERT-Large Specialist 91.2 89.3 78.8 86.4
Llama3-SteerLM-RM (70B) 90.6 92.8 89.7 91.0

This suggests that bidirectional masked LM architectures are especially effective at leveraging global context for both domain‐specialized and generalist reward modeling.

6. Analysis, Limitations, and Prospective Directions

BiRMs and BiPRMs are strictly “plug-and-play” with existing PRM architectures and support various loss functions, including classification, regression, and ranking. Main advantages include improved early and late-step error localization, robustness to scale, and strong discrimination in “dead-end” vs “promising” partial reasoning trajectories.

Limitations include:

  • Current evaluations are limited to mathematical and stepwise reasoning; generalization to open-ended dialog, code generation with stateful/irreversible transitions, or long-range dependencies remains unverified.
  • Simple averaging (e.g., s<ts_{<t}3) for fusion may be suboptimal; learnable or attention-based fusion could provide further improvements.
  • R2L/reversed prompt strategies assume solution trajectories are self-contained and fully reversible.

Future research directions include hybrid reward models incorporating human feedback for step-local and global coherence, training with learnable fusion weights, and extending bidirectionality to non-monotonic or multi-pass reward ordering strategies (Zhang et al., 3 Aug 2025, Chen et al., 6 Mar 2025).

BiRMs connect closely to classical planning heuristics (A*-style fusing incurred and estimated future cost) and to masked LLM preference frameworks. The emergence of efficient, low-parameter bidirectional models (e.g., TinyRM) with competitive performance over much larger decoder-based reward models demonstrates the amenability of bidirectional context modeling for both process and preference evaluation (Pan, 14 Jul 2025). A plausible implication is that future reward modeling for LLMs will increasingly rely on bidirectional or non-myopic architectures, especially in domains requiring global consistency and strong error correction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bi-directional Reward Models (BiRM).