DreamPRM-Code: Modular Code Generation
- DreamPRM-Code is a coding-focused Process Reward Model that leverages function-level (Chain-of-Function) decomposition to structure LLM-generated programs.
- It employs test-time scaling by evaluating multiple candidate solutions with stepwise rewards, achieving 80.9% pass@1 on LiveCodeBench.
- The model uses bi-level meta-learning to iteratively correct noisy labels, enhancing reward model performance for reliable code synthesis.
DreamPRM-Code is a coding-focused Process Reward Model (PRM) that advances code generation for LLMs by combining a function-level step decomposition (Chain-of-Function prompting) with a meta-learning-driven label correction mechanism for robust stepwise reward modeling. Designed for test-time scaling, DreamPRM-Code enables fine-grained evaluation and selection among LLM-generated program candidates, showing state-of-the-art performance on LiveCodeBench (Zhang et al., 17 Dec 2025).
1. Chain-of-Function Prompting for Code Decomposition
Unlike existing PRM techniques in mathematical reasoning, which decompose solutions into textual steps (Chain-of-Thought), code presents a challenge due to the lack of natural step granularity. DreamPRM-Code addresses this by introducing Chain-of-Function (CoF) prompting. This strategy forces modular, top-down program construction: the LLM is prompted to write a high-level main() function (with a docstring outlining the algorithm), followed by top-level helper functions (build_graph(), dijkstra(), etc.), each with descriptive docstrings. Nested function definitions are explicitly discouraged.
Upon generation, the model output is parsed into a sequence of code blocks , where is the main function and subsequent are helpers. Each becomes an intermediate "state" for reward assignment, rendering every function a distinct reasoning step for the PRM.
2. Process Reward Model Formalization and Test-Time Scaling
DreamPRM-Code trains a stepwise reward model that assigns rewards to each intermediate code state in a trajectory. The aggregate trajectory reward is
At test time, the base LLM produces candidate solution trajectories via Chain-of-Function prompting. The model computes for each and selects
as the final chosen solution. This "test-time scaling" leverages process-level credit assignment to select superior code among LLM outputs.
3. Meta-Learning-Based Label Correction via Bi-Level Optimization
Reward model training suffers from label noise due to Monte Carlo (MC) sampling: intermediate labels are binary values indicating if a partial program state eventually leads to a passing final solution, but are unreliable due to stochastic rollout trajectories. DreamPRM-Code introduces a meta-learning-based label correction scheme that treats the noisy labels as tunable parameters. The training procedure collects two datasets:
- : Noisy intermediate states and their MC-derived labels
- : Full-program states with clean unit-test-derived labels
The model solves a bi-level optimization problem:
- Lower-level: Learn PRM parameters to minimize the binary-cross-entropy loss over noisy intermediate labels:
- Upper-level: Adjust intermediate labels so the resulting PRM yields optimal performance on the clean meta dataset:
Here, the noisy labels are iteratively updated to minimize outer loss, backpropagating through the implicit dependence of on . Practically, one-step unrolled alternating optimization is used per batch, jointly improving label quality and PRM generalization.
4. Training and Inference Pipeline
Training Procedure:
- Chain-of-Function trajectories and MC labels are generated from Qwen3-Coder-30B-A3B.
- The PRM is trained as a binary classifier (replacing the LM head of a Qwen-2.5-Coder-3B backbone) over stepwise states, optimized with Adam (learning rate: 1e-4, weight decay: 1e-2, batch size: 8) and meta-learning rate 1e-2.
- The meta label correction loop runs per batch, alternating between PRM updates and label refinement.
Inference (Test-Time Scaling):
- For each coding problem, O4-mini-high is prompted with CoF to produce candidate code solutions.
- Each sample is decomposed into steps. Stepwise rewards are computed and averaged to obtain .
- The solution with maximal is selected.
5. Empirical Evaluation on LiveCodeBench
DreamPRM-Code was evaluated on LiveCodeBench under a strict temporal split:
- Training set: 601 problems published before 2024-08-01.
- Test set: 131 problems published after 2025-02-01.
Baseline comparisons included unguided LLMs (O4-mini-high, Gemini-2.5, DeepSeek-R1, O3), test-time scaling with Outcome Reward Models (ORMs), and an ablation "PRM-CoF" omitting label correction. Results are summarized below:
| Method | Easy | Medium | Hard | Overall |
|---|---|---|---|---|
| Gemini-2.5 | 82.1 | 52.5 | 72.5 | – |
| O3 | 71.8 | 57.4 | 71.8 | – |
| DeepSeek-R1 | 99.7 | 77.7 | 47.2 | 68.7 |
| O4-mini-high | 89.7 | 57.4 | 77.1 | – |
| ORM (O4-mini-high) | 89.7 | 62.3 | 79.4 | – |
| PRM-CoF | 92.3 | 62.3 | 80.2 | – |
| DreamPRM-Code | 92.3 | 63.9 | 80.9 | – |
DreamPRM-Code achieves 80.9% pass@1 on the test set, outperforming O4-mini-high by 3.8 percentage points and all ablations, demonstrating the benefits of function-level decomposition and meta label correction for reliable end-to-end code generation.
6. Algorithmic Workflow
The combined PRM and meta-correction training is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
for epoch in range(num_epochs): for batch in training_data: # Generate CoF trajectories and MC labels S = [s1, ..., sK] # Chain-of-Function steps Y_noisy = [y1, ..., yK] # MC step labels y_meta = final unit test label # For full program # Form batches X = union of steps; Y = union of labels X_meta, Y_meta = CoF final states and their test results # Meta label correction loop # 1. PRM parameter update (gradient step) # 2. Label update (one-step unroll via meta-gradient) |
7. Context within PRMs and Future Directions
DreamPRM-Code extends the general PRM paradigm, moving beyond previously studied domains (e.g., mathematical reasoning, multimodal reasoning (Cao et al., 26 May 2025)) by introducing a modular approach aligned with software engineering practices (function decomposition) and rigorous handling of label noise through bi-level meta-learning. The clear separation of code into interpretable intermediate states, combined with systematic label denoising tethered to unit tests, positions DreamPRM-Code as a template for robust, test-time scalable reward modeling in program synthesis. This approach suggests a broader utility of function-level reasoning and meta-labeling strategies for other domains lacking well-defined intermediate annotation schemes.
The methodology demonstrates that step-level granularity and noise correction are synergistic in maximizing LLM code generation quality, supporting adoption in future PRM frameworks and laying groundwork for further exploration of process supervision in structured domains (Zhang et al., 17 Dec 2025).