PRM-BAS: Guided Beam Annealing for Multimodal Reasoning
- The paper demonstrates that PRM-BAS employs process reward models to dynamically guide an adaptive beam search, leading to state-of-the-art performance with minimal computational overhead.
- PRM-BAS uses an annealing strategy that reduces the beam size as confidence in candidate steps increases, efficiently managing resource allocation during multi-hop reasoning.
- Empirical evaluations across eight benchmarks show that PRM-BAS consistently improves accuracy by 3–11 percentage points over traditional search methods.
PRM-BAS (Process Reward Model–Guided Beam Annealing Search) is a lightweight search and inference strategy designed to enhance the step-wise reasoning performance of Multimodal LLMs (MLLMs) by dynamically guiding search with Process Reward Models (PRMs). PRM-BAS addresses key limitations in prior reasoning systems by introducing dense, step-wise reward estimation and adaptive, resource-efficient search dynamics, achieving state-of-the-art accuracy with minimal computational overhead across a broad set of benchmarks (Hu et al., 14 Apr 2025).
1. Motivation and Core Principles
Traditional Outcome Reward Models (ORMs) in MLLM reasoning evaluate only complete solutions after inference, resulting in delayed feedback and challenging credit assignment for intermediate steps. PRMs, by contrast, provide densely distributed, step-level supervision, making them well-suited for guiding complex, multi-hop reasoning. However, integrating PRMs into practical search procedures has proven nontrivial; prior approaches such as Best-of-N sampling, Monte Carlo Tree Search (MCTS), and fixed-beam search are generally either computationally expensive or insufficiently granular for real-world multimodal reasoning scenarios.
PRM-BAS's main insight is to adaptively allocate search resources as a function of the state-wise confidence in PRM estimates. Early in the reasoning chain, contextual information is scarce and PRM guidance is correspondingly noisy; larger beams are thus maintained to cover the broad space of possibilities. As more context accumulates in later reasoning steps and PRM estimates become more reliable, the search beam is progressively "annealed"—that is, the number of tracked candidates is systematically reduced. This approach realizes superior performance-efficiency trade-offs by focusing search effort where it is most needed.
2. Formal Algorithmic Structure
The PRM-BAS algorithm operates on an MDP where each state comprises the input image , textual question , and the partial answer . The agent selects an action , corresponding to a fixed-length token segment (), and transitions to state .
The PRM estimates the probability that a given pair will lead to a correct final answer. At each step:
- The beam size is set via a linear annealing rule: 0, with 1 the initial beam size, 2 the annealing rate, and 3 a minimum beam threshold.
- Each hypothesis in the current beam expands by sampling 4 candidate actions using the policy 5.
- Candidate continuations are scored using the PRM, aggregated into auxiliary node sets, and pruned to the top 6 by cumulative PRM score.
- This process is repeated over 7 reasoning steps, and the candidate with the highest accumulated PRM-estimated score is selected as the final answer.
8
This adaptive beam scheduling is a distinguishing feature, directly exploiting the state-dependent variance in PRM predictions.
3. Process Reward Model Objectives
The PRM is trained to provide both absolute and relative assessments of candidate step quality. Specifically:
- Value loss: A binary cross-entropy loss between the PRM's scalar predictions 9 and rollout-estimated soft rewards 0, averaged over all steps 1 and 2 candidates.
3
- Ranking loss: For step 4, and candidate pairs 5 with reward differences exceeding a margin 6, a pairwise log-sigmoid loss encourages correct ranking.
7
- Combined loss: The aggregate training objective is
8
with 9.
Empirical ablations demonstrate that omission of process supervision, rank loss, or soft labels substantially reduces performance under PRM-BAS.
4. Data Collection and PRM Training Pipeline
PRM-BAS leverages the PRM-BAS-300k dataset, curated for dense process-supervised multimodal reasoning:
- Composition: Approximately 300,000 Q-A pairs are sourced from MathV360K (math word problems and diagrams), M3CoT (multi-step chain-of-thought), and ChartQA (chart-based reasoning). Multiple-choice and true/false items are largely excluded to avoid label noise.
- Rollout-based step sampling: At each intermediate state 0, 1 candidate actions are sampled. For each, 2 full rollouts are performed, and the soft reward 3 is assigned as the mean correctness over completions.
- Adaptive rollout allocation: On a 5k pilot subset, chains are grouped by length and reward variance. Larger 4 are used for high-variance early steps and reduced later.
- Class balancing: Each training mini-batch ensures the ratio of positive to negative actions does not exceed 3:1.
- Training: The reward head (an MLP) is initialized from Qwen2-VL-7B and fine-tuned two epochs using AdamW (5, batch size 1024, ZeRO) on the PRM-BAS-300k dataset.
This unified construction and training framework is central to PRM-BAS's efficacy in providing reliable intermediate rewards.
5. Benchmarks and Empirical Performance
PRM-BAS is evaluated across eight challenging multimodal reasoning benchmarks, including MathVista, MathVision, MathVerse VO, DynaMath, M3CoT, ChartQA, LogicVista, and ScienceQA. Two main policy models are assessed: Qwen2-VL-7B and Qwen2.5-VL-7B, with the PRM trained exclusively on Qwen2-VL-7B rollouts.
Comparisons with single-shot, Best-of-N (BoN, 6), and step-level BoN search demonstrate the following:
| Policy Model | Vanilla (%) | +PRM-BAS (%) | Absolute Gain (pp) |
|---|---|---|---|
| Qwen2-VL-7B | 51.2 | 59.6 | +8.4 |
| Qwen2.5-VL-7B | 59.6 | 64.2 | +4.6 |
PRM-BAS consistently outperforms other open-source MLLMs and achieves parity with some closed-source models (e.g., certain GPT-4o/Claude variants). Token-budget comparison (TTS curves) shows PRM-BAS is more efficient than both BoN and step-BoN. Ablation studies confirm that both value and rank terms, and the use of soft labels, are critical for optimal performance.
6. Analysis and Limitations
The adaptive resource allocation of beam annealing yields up to an 8 percentage point improvement without increasing the token budget relative to BoN 8. Most early-step gains are achieved without incurring significant later-step cost. PRM-BAS generalizes across architectures and scales, with observed performance gains of +3–11 percentage points on previously unseen models, including those with 2B, 3B, and InternVL 2B/8B parameters. However, models with architectural or style dissimilarities (such as InternVL) exhibit reduced improvements.
Policy dependence is identified as a key limitation: PRM performance is optimal when the rollout policy used for PRM data matches the inference policy. This suggests benefits from collecting multi-policy data for broader generalization. Current experiments are limited to policy models of up to 8B parameters, and evaluations on larger scales remain open for future investigation.
7. Summary and Significance
PRM-BAS constitutes a plug-and-play wrapper for inference that integrates:
- PRM-based scoring of intermediate reasoning steps,
- Rollout-based soft label generation for dense reward supervision,
- A dual-objective PRM loss incorporating both value and ranking constraints,
- An adaptively annealed beam search that modulates search breadth in concert with PRM reliability.
This approach delivers robust, state-of-the-art multimodal reasoning capabilities with minimal computational overhead and demonstrates strong generalization and flexibility across model architectures and datasets (Hu et al., 14 Apr 2025).