DreamPRM-Code: Modular Code Generation

Updated 24 December 2025

DreamPRM-Code is a coding-focused Process Reward Model that leverages function-level (Chain-of-Function) decomposition to structure LLM-generated programs.
It employs test-time scaling by evaluating multiple candidate solutions with stepwise rewards, achieving 80.9% pass@1 on LiveCodeBench.
The model uses bi-level meta-learning to iteratively correct noisy labels, enhancing reward model performance for reliable code synthesis.

DreamPRM-Code is a coding-focused Process Reward Model (PRM) that advances code generation for LLMs by combining a function-level step decomposition (Chain-of-Function prompting) with a meta-learning-driven label correction mechanism for robust stepwise reward modeling. Designed for test-time scaling, DreamPRM-Code enables fine-grained evaluation and selection among LLM-generated program candidates, showing state-of-the-art performance on LiveCodeBench (Zhang et al., 17 Dec 2025).

1. Chain-of-Function Prompting for Code Decomposition

Unlike existing PRM techniques in mathematical reasoning, which decompose solutions into textual steps (Chain-of-Thought), code presents a challenge due to the lack of natural step granularity. DreamPRM-Code addresses this by introducing Chain-of-Function (CoF) prompting. This strategy forces modular, top-down program construction: the LLM is prompted to write a high-level main() function (with a docstring outlining the algorithm), followed by top-level helper functions (build_graph(), dijkstra(), etc.), each with descriptive docstrings. Nested function definitions are explicitly discouraged.

Upon generation, the model output is parsed into a sequence of code blocks $S = [s_1, s_2, ..., s_K]$ , where $s_1$ is the main function and subsequent $s_k$ are helpers. Each $s_t$ becomes an intermediate "state" for reward assignment, rendering every function a distinct reasoning step for the PRM.

2. Process Reward Model Formalization and Test-Time Scaling

DreamPRM-Code trains a stepwise reward model $f_\theta(s_t) \in \mathbb{R}$ that assigns rewards to each intermediate code state $s_t$ in a trajectory. The aggregate trajectory reward is

$R(S;\theta) = \frac{1}{K} \sum_{t=1}^K f_\theta(s_t).$

At test time, the base LLM $\pi$ produces $N$ candidate solution trajectories $\{S^{(i)}\}_{i=1}^N$ via Chain-of-Function prompting. The model computes $s_1$ 0 for each and selects

$s_1$ 1

as the final chosen solution. This "test-time scaling" leverages process-level credit assignment to select superior code among LLM outputs.

3. Meta-Learning-Based Label Correction via Bi-Level Optimization

Reward model training suffers from label noise due to Monte Carlo (MC) sampling: intermediate labels $s_1$ 2 are binary values indicating if a partial program state $s_1$ 3 eventually leads to a passing final solution, but are unreliable due to stochastic rollout trajectories. DreamPRM-Code introduces a meta-learning-based label correction scheme that treats the noisy labels $s_1$ 4 as tunable parameters. The training procedure collects two datasets:

$s_1$ 5: Noisy intermediate states and their MC-derived labels
$s_1$ 6: Full-program states with clean unit-test-derived labels

The model solves a bi-level optimization problem:

Lower-level: Learn PRM parameters $s_1$ 7 to minimize the binary-cross-entropy loss over noisy intermediate labels:

$s_1$ 8

Upper-level: Adjust intermediate labels $s_1$ 9 so the resulting PRM yields optimal performance on the clean meta dataset:

$s_k$ 0

Here, the noisy $s_k$ 1 labels are iteratively updated to minimize outer loss, backpropagating through the implicit dependence of $s_k$ 2 on $s_k$ 3. Practically, one-step unrolled alternating optimization is used per batch, jointly improving label quality and PRM generalization.

4. Training and Inference Pipeline

Training Procedure:

Chain-of-Function trajectories and MC labels are generated from Qwen3-Coder-30B-A3B.
The PRM $s_k$ 4 is trained as a binary classifier (replacing the LM head of a Qwen-2.5-Coder-3B backbone) over stepwise states, optimized with Adam (learning rate: 1e-4, weight decay: 1e-2, batch size: 8) and meta-learning rate 1e-2.
The meta label correction loop runs per batch, alternating between PRM updates and label refinement.

Inference (Test-Time Scaling):

For each coding problem, O4-mini-high is prompted with CoF to produce $s_k$ 5 candidate code solutions.
Each sample is decomposed into $s_k$ 6 steps. Stepwise rewards $s_k$ 7 are computed and averaged to obtain $s_k$ 8.
The solution with maximal $s_k$ 9 is selected.

5. Empirical Evaluation on LiveCodeBench

DreamPRM-Code was evaluated on LiveCodeBench under a strict temporal split:

Training set: 601 problems published before 2024-08-01.
Test set: 131 problems published after 2025-02-01.

Baseline comparisons included unguided LLMs (O4-mini-high, Gemini-2.5, DeepSeek-R1, O3), test-time scaling with Outcome Reward Models (ORMs), and an ablation "PRM-CoF" omitting label correction. Results are summarized below:

Method	Easy	Medium	Hard	Overall
Gemini-2.5	82.1	52.5	72.5	–
O3	71.8	57.4	71.8	–
DeepSeek-R1	99.7	77.7	47.2	68.7
O4-mini-high	89.7	57.4	77.1	–
ORM (O4-mini-high)	89.7	62.3	79.4	–
PRM-CoF	92.3	62.3	80.2	–
DreamPRM-Code	92.3	63.9	80.9	–

DreamPRM-Code achieves 80.9% pass@1 on the test set, outperforming O4-mini-high by 3.8 percentage points and all ablations, demonstrating the benefits of function-level decomposition and meta label correction for reliable end-to-end code generation.

6. Algorithmic Workflow

The combined PRM and meta-correction training is as follows:

$s_t$ 1 This process ensures the learned $s_t$ 0 is robust to noisy intermediate supervision by explicit meta-level guidance from high-fidelity end-to-end test outcomes.

7. Context within PRMs and Future Directions

DreamPRM-Code extends the general PRM paradigm, moving beyond previously studied domains (e.g., mathematical reasoning, multimodal reasoning (Cao et al., 26 May 2025)) by introducing a modular approach aligned with software engineering practices (function decomposition) and rigorous handling of label noise through bi-level meta-learning. The clear separation of code into interpretable intermediate states, combined with systematic label denoising tethered to unit tests, positions DreamPRM-Code as a template for robust, test-time scalable reward modeling in program synthesis. This approach suggests a broader utility of function-level reasoning and meta-labeling strategies for other domains lacking well-defined intermediate annotation schemes.

The methodology demonstrates that step-level granularity and noise correction are synergistic in maximizing LLM code generation quality, supporting adoption in future PRM frameworks and laying groundwork for further exploration of process supervision in structured domains (Zhang et al., 17 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (2)

DreamPRM-Code: Function-as-Step Process Reward Model with Label Correction for LLM Coding (2025)

DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DreamPRM-Code.

DreamPRM-Code: Modular Code Generation

1. Chain-of-Function Prompting for Code Decomposition

2. Process Reward Model Formalization and Test-Time Scaling

3. Meta-Learning-Based Label Correction via Bi-Level Optimization

4. Training and Inference Pipeline

5. Empirical Evaluation on LiveCodeBench

6. Algorithmic Workflow

7. Context within PRMs and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DreamPRM-Code: Modular Code Generation

1. Chain-of-Function Prompting for Code Decomposition

2. Process Reward Model Formalization and Test-Time Scaling

3. Meta-Learning-Based Label Correction via Bi-Level Optimization

4. Training and Inference Pipeline

5. Empirical Evaluation on LiveCodeBench

6. Algorithmic Workflow

7. Context within PRMs and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research