Bi-directional Reward Models (BiRM)
- BiRM is a neural framework that computes rewards using both historical correctness and future outcome estimates to guide complex reasoning.
- It integrates backward scores with forward-looking value estimates, distinguishing promising partial solutions from dead-ends.
- Variants like BiPRM and TinyRM demonstrate improved efficiency and accuracy over traditional unidirectional reward models in LLM applications.
A bi-directional reward model (BiRM) is a neural process supervision framework in which reward assignment to intermediate or full trajectories is computed using signals from both the past (backward; historical reasoning correctness) and the future (forward; estimated probability of eventual solution correctness). Unlike traditional unidirectional reward models, BiRMs exploit the full trajectory context or combine stepwise correctness with prospective value estimates, resulting in improved reward fidelity for complex reasoning tasks in LLMs. This bi-directional paradigm underlies several architectural and algorithmic advances across process reward modeling for mathematical reasoning, stepwise evaluation, and test-time preference selection (Zhang et al., 3 Aug 2025, Chen et al., 6 Mar 2025, Pan, 14 Jul 2025).
1. Theoretical Motivations and Foundations
Conventional process reward models (PRMs) operate in a left-to-right (L2R) fashion, assigning each reasoning step a reward using only the input question and the prior steps . This unidirectionality yields a “myopic” property: so errors or knowledge from future steps cannot revise earlier reward assignments. As a result, PRMs perform well at early error localization but cannot verify, correct, or contextualize earlier steps based on later developments, leading to reduced discriminative power for partial solutions unlikely to succeed.
The motivation for bi-directionality draws on optimal search strategies such as A*: where is incurred (backward) cost and is a heuristic (forward) estimate to goal. BiRM analogues combine a backward score —aggregating the correctness of steps up to —with a forward-looking score 0, estimating the probability that continuations from the current prefix will reach a correct final answer. This fusion enables models to distinguish “dead-ends” versus genuinely promising partial solutions, mitigating early over-confidence and late under-confidence in classical evaluators (Chen et al., 6 Mar 2025).
2. Model Formulations and Architectural Variants
Several bi-directional process reward model families have emerged:
2.1. Explicit Value-Augmented BiRM (A*-Style)
BiRM (Chen et al., 6 Mar 2025) attaches two reward heads to a shared encoder:
- A “reward head” 1 predicting per-step correctness probability 2,
- A “value head” 3 estimating the probability that a partial trajectory 4 leads to a correct answer, 5.
The stepwise score is
6
where 7 is often the average or product over prior step rewards, and 8 is the estimated forward value. 9 is tuned per backbone and dataset.
2.2. Bidirectional Prompt Reversal (BiPRM)
The Bidirectional Process Reward Model (BiPRM) (Zhang et al., 3 Aug 2025) introduces dual evaluation by feeding both the canonical L2R trajectory and its reversed right-to-left (R2L) version as separate prompts to the same reward model 0, sharing all parameters and introducing no inference overhead when parallelized. The R2L context for 1 includes all subsequent steps 2.
The fused per-step BiPRM score is: 3 where R2L employs only prompt reordering. Trajectory-level scoring aggregates these via operators such as min, max, average, or product.
2.3. Bidirectional Masked LLM Rewarding
TinyRM (Pan, 14 Jul 2025) exemplifies a bidirectional approach in preference/reward modeling using masked LLM (MLM) encoders. Full instruction–response pairs are presented bidirectionally, allowing the encoder to leverage the global sequence context when predicting preference or safety tokens at a mask location. This results in improved reasoning and safety benchmarking, with strong performance despite using significantly fewer parameters than decoder-based models.
3. Training Objectives, Data, and Calibration
Bidirectional reward models require supervision for both stepwise correctness and future outcome probability:
- Backward reward labels: per-step binary labels 4 derived from reference solutions (e.g., “is this step correct?”).
- Forward/value labels: for each partial prefix, value labels 5 are computed via Monte Carlo rollouts, outcome propagation, or fraction of correct completions.
- Joint training objective (BiRM (Chen et al., 6 Mar 2025)):
6
where 7 balances the importance of value modeling and stepwise correctness.
For BiPRM (Zhang et al., 3 Aug 2025), both L2R and R2L streams are used in training, and rewards are fused accordingly. Loss terms match those of the underlying PRM objectives (BCE, MSE, Q-value ranking, etc.).
TinyRM (Pan, 14 Jul 2025) is trained with a cloze-style MLM loss, freezing bottom layers and fine-tuning only top layers plus low-rank adapters (using Directional Low-Rank Adaptation, DoRA), optimizing: 8
Annotation of forward value labels via “soft” Monte Carlo rollouts provides superior future success estimation.
4. Inference Algorithms and Practical Deployment
4.1. Trajectory Selection
For best-of-9 sampling, all candidates are scored using BiRM or BiPRM policies, typically by computing 0 for the final step of each trajectory. The trajectory maximizing 1 is selected.
In beam search, stepwise BiRM scores re-rank beam expansions at each stage, using current backward rewards and forward value predictions to guide exploration (see Algorithm 1 in (Chen et al., 6 Mar 2025)). BiPRM maintains parallel evaluation using both forward and reversed prompts per step.
4.2. Efficiency and Scalability
BiPRM (Zhang et al., 3 Aug 2025) entails zero extra parameters (the same 2 is used for both L2R and R2L) and, if parallelized, incurs no wall-clock latency penalty. Only stepwise inference costs are doubled if executed serially, which is minor relative to full LLM sampling. TinyRM (Pan, 14 Jul 2025) achieves further efficiency through masked LM encoders and low-rank adaptation, yielding sub-millisecond latency on modern accelerators even at over 400M parameters.
5. Empirical Performance and Quantitative Evaluation
BiRM and BiPRM consistently surpass unidirectional PRM baselines in mathematical reasoning, process supervision, and preference modeling.
5.1. Process Supervision and Mathematical Reasoning
| Model/Setting | Dataset | Best-of-N (N) | Baseline | BiRM/BiPRM | Absolute Gain | Relative Gain |
|---|---|---|---|---|---|---|
| BiPRM (Rho-1B, BCE, MM-13B) | MATH500 | BON@128 | 25.08 | 33.08 | +8.00 | +31.9% |
| BiPRM (Qwen2.5-1.5B, BCE, MM) | MATH500 | BON@128 | 24.20 | 36.60 | +12.40 | +47.1% |
| BiRM (Qwen2.5-7B) | Gaokao23 | BoN@512 | 47.3 | 50.4 | +3.1 | — |
| BiRM (Qwen2.5-7B) | MATH-500 | BoN@512 | 58.4 | 63.4 | +5.0 | — |
BiPRM outperforms L2R PRM in all 54 configurations (backbones × objectives × sampling policies × datasets) (Zhang et al., 3 Aug 2025). BiRM achieves up to 5.0 percentage point improvement over objective reward models (ORMs) and outperforms PRMs in both sampling and beam search regimes (Chen et al., 6 Mar 2025).
5.2. Bidirectional Preference Modeling
On RewardBench, TinyRM (400M parameters) matches or exceeds 70B decoder models for reasoning (91.2% vs 90.6%) and achieves 89.3% accuracy on safety, while drastically reducing inference compute (Pan, 14 Jul 2025).
| Model | Reasoning | Safety | Chat | Overall |
|---|---|---|---|---|
| ModernBERT-Large Specialist | 91.2 | 89.3 | 78.8 | 86.4 |
| Llama3-SteerLM-RM (70B) | 90.6 | 92.8 | 89.7 | 91.0 |
This suggests that bidirectional masked LM architectures are especially effective at leveraging global context for both domain‐specialized and generalist reward modeling.
6. Analysis, Limitations, and Prospective Directions
BiRMs and BiPRMs are strictly “plug-and-play” with existing PRM architectures and support various loss functions, including classification, regression, and ranking. Main advantages include improved early and late-step error localization, robustness to scale, and strong discrimination in “dead-end” vs “promising” partial reasoning trajectories.
Limitations include:
- Current evaluations are limited to mathematical and stepwise reasoning; generalization to open-ended dialog, code generation with stateful/irreversible transitions, or long-range dependencies remains unverified.
- Simple averaging (e.g., 3) for fusion may be suboptimal; learnable or attention-based fusion could provide further improvements.
- R2L/reversed prompt strategies assume solution trajectories are self-contained and fully reversible.
Future research directions include hybrid reward models incorporating human feedback for step-local and global coherence, training with learnable fusion weights, and extending bidirectionality to non-monotonic or multi-pass reward ordering strategies (Zhang et al., 3 Aug 2025, Chen et al., 6 Mar 2025).
7. Related Methods and Broader Context
BiRMs connect closely to classical planning heuristics (A*-style fusing incurred and estimated future cost) and to masked LLM preference frameworks. The emergence of efficient, low-parameter bidirectional models (e.g., TinyRM) with competitive performance over much larger decoder-based reward models demonstrates the amenability of bidirectional context modeling for both process and preference evaluation (Pan, 14 Jul 2025). A plausible implication is that future reward modeling for LLMs will increasingly rely on bidirectional or non-myopic architectures, especially in domains requiring global consistency and strong error correction.