Papers
Topics
Authors
Recent
Search
2000 character limit reached

DreamPRM-Code: Modular Code Generation

Updated 24 December 2025
  • DreamPRM-Code is a coding-focused Process Reward Model that leverages function-level (Chain-of-Function) decomposition to structure LLM-generated programs.
  • It employs test-time scaling by evaluating multiple candidate solutions with stepwise rewards, achieving 80.9% pass@1 on LiveCodeBench.
  • The model uses bi-level meta-learning to iteratively correct noisy labels, enhancing reward model performance for reliable code synthesis.

DreamPRM-Code is a coding-focused Process Reward Model (PRM) that advances code generation for LLMs by combining a function-level step decomposition (Chain-of-Function prompting) with a meta-learning-driven label correction mechanism for robust stepwise reward modeling. Designed for test-time scaling, DreamPRM-Code enables fine-grained evaluation and selection among LLM-generated program candidates, showing state-of-the-art performance on LiveCodeBench (Zhang et al., 17 Dec 2025).

1. Chain-of-Function Prompting for Code Decomposition

Unlike existing PRM techniques in mathematical reasoning, which decompose solutions into textual steps (Chain-of-Thought), code presents a challenge due to the lack of natural step granularity. DreamPRM-Code addresses this by introducing Chain-of-Function (CoF) prompting. This strategy forces modular, top-down program construction: the LLM is prompted to write a high-level main() function (with a docstring outlining the algorithm), followed by top-level helper functions (build_graph(), dijkstra(), etc.), each with descriptive docstrings. Nested function definitions are explicitly discouraged.

Upon generation, the model output is parsed into a sequence of code blocks S=[s1,s2,...,sK]S = [s_1, s_2, ..., s_K], where s1s_1 is the main function and subsequent sks_k are helpers. Each sts_t becomes an intermediate "state" for reward assignment, rendering every function a distinct reasoning step for the PRM.

2. Process Reward Model Formalization and Test-Time Scaling

DreamPRM-Code trains a stepwise reward model fθ(st)Rf_\theta(s_t) \in \mathbb{R} that assigns rewards to each intermediate code state sts_t in a trajectory. The aggregate trajectory reward is

R(S;θ)=1Kt=1Kfθ(st).R(S;\theta) = \frac{1}{K} \sum_{t=1}^K f_\theta(s_t).

At test time, the base LLM π\pi produces NN candidate solution trajectories {S(i)}i=1N\{S^{(i)}\}_{i=1}^N via Chain-of-Function prompting. The model computes R(S(i);θ)R(S^{(i)}; \theta) for each and selects

i=argmaxiR(S(i);θ)i^* = \arg\max_i R(S^{(i)}; \theta)

as the final chosen solution. This "test-time scaling" leverages process-level credit assignment to select superior code among LLM outputs.

3. Meta-Learning-Based Label Correction via Bi-Level Optimization

Reward model training suffers from label noise due to Monte Carlo (MC) sampling: intermediate labels YY are binary values indicating if a partial program state sts_t eventually leads to a passing final solution, but are unreliable due to stochastic rollout trajectories. DreamPRM-Code introduces a meta-learning-based label correction scheme that treats the noisy labels YY as tunable parameters. The training procedure collects two datasets:

  • (X,Y)(X, Y): Noisy intermediate states and their MC-derived labels
  • (Xmeta,Ymeta)(X_{\text{meta}}, Y_{\text{meta}}): Full-program states with clean unit-test-derived labels

The model solves a bi-level optimization problem:

  • Lower-level: Learn PRM parameters θ\theta to minimize the binary-cross-entropy loss over noisy intermediate labels:

θ(Y)=argminθ1X(s,y)(X,Y)(fθ(s),y)\theta^*(Y) = \arg\min_\theta \frac{1}{|X|} \sum_{(s, y)\in(X, Y)} \ell(f_\theta(s), y)

  • Upper-level: Adjust intermediate labels YY so the resulting PRM yields optimal performance on the clean meta dataset:

Y=argminY1Xmeta(smeta,ymeta)(Xmeta,Ymeta)(fθ(Y)(smeta),ymeta)Y^* = \arg\min_Y \frac{1}{|X_{\text{meta}}|} \sum_{(s_{\text{meta}}, y_{\text{meta}})\in(X_{\text{meta}}, Y_{\text{meta}})} \ell(f_{\theta^*(Y)}(s_{\text{meta}}), y_{\text{meta}})

Here, the noisy YY labels are iteratively updated to minimize outer loss, backpropagating through the implicit dependence of θ(Y)\theta^*(Y) on YY. Practically, one-step unrolled alternating optimization is used per batch, jointly improving label quality and PRM generalization.

4. Training and Inference Pipeline

Training Procedure:

  • Chain-of-Function trajectories and MC labels are generated from Qwen3-Coder-30B-A3B.
  • The PRM fθf_\theta is trained as a binary classifier (replacing the LM head of a Qwen-2.5-Coder-3B backbone) over stepwise states, optimized with Adam (learning rate: 1e-4, weight decay: 1e-2, batch size: 8) and meta-learning rate 1e-2.
  • The meta label correction loop runs per batch, alternating between PRM updates and label refinement.

Inference (Test-Time Scaling):

  • For each coding problem, O4-mini-high is prompted with CoF to produce N=4N=4 candidate code solutions.
  • Each sample is decomposed into KiK_i steps. Stepwise rewards fθ(st)f_\theta(s_t) are computed and averaged to obtain R(S;θ)R(S; \theta).
  • The solution with maximal RR is selected.

5. Empirical Evaluation on LiveCodeBench

DreamPRM-Code was evaluated on LiveCodeBench under a strict temporal split:

  • Training set: 601 problems published before 2024-08-01.
  • Test set: 131 problems published after 2025-02-01.

Baseline comparisons included unguided LLMs (O4-mini-high, Gemini-2.5, DeepSeek-R1, O3), test-time scaling with Outcome Reward Models (ORMs), and an ablation "PRM-CoF" omitting label correction. Results are summarized below:

Method Easy Medium Hard Overall
Gemini-2.5 82.1 52.5 72.5
O3 71.8 57.4 71.8
DeepSeek-R1 99.7 77.7 47.2 68.7
O4-mini-high 89.7 57.4 77.1
ORM (O4-mini-high) 89.7 62.3 79.4
PRM-CoF 92.3 62.3 80.2
DreamPRM-Code 92.3 63.9 80.9

DreamPRM-Code achieves 80.9% pass@1 on the test set, outperforming O4-mini-high by 3.8 percentage points and all ablations, demonstrating the benefits of function-level decomposition and meta label correction for reliable end-to-end code generation.

6. Algorithmic Workflow

The combined PRM and meta-correction training is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
for epoch in range(num_epochs):
    for batch in training_data:
        # Generate CoF trajectories and MC labels
        S = [s1, ..., sK]                   # Chain-of-Function steps
        Y_noisy = [y1, ..., yK]             # MC step labels
        y_meta = final unit test label       # For full program
        
        # Form batches
        X = union of steps; Y = union of labels
        X_meta, Y_meta = CoF final states and their test results
        
        # Meta label correction loop
        # 1. PRM parameter update (gradient step)
        # 2. Label update (one-step unroll via meta-gradient)
This process ensures the learned fθf_\theta is robust to noisy intermediate supervision by explicit meta-level guidance from high-fidelity end-to-end test outcomes.

7. Context within PRMs and Future Directions

DreamPRM-Code extends the general PRM paradigm, moving beyond previously studied domains (e.g., mathematical reasoning, multimodal reasoning (Cao et al., 26 May 2025)) by introducing a modular approach aligned with software engineering practices (function decomposition) and rigorous handling of label noise through bi-level meta-learning. The clear separation of code into interpretable intermediate states, combined with systematic label denoising tethered to unit tests, positions DreamPRM-Code as a template for robust, test-time scalable reward modeling in program synthesis. This approach suggests a broader utility of function-level reasoning and meta-labeling strategies for other domains lacking well-defined intermediate annotation schemes.

The methodology demonstrates that step-level granularity and noise correction are synergistic in maximizing LLM code generation quality, supporting adoption in future PRM frameworks and laying groundwork for further exploration of process supervision in structured domains (Zhang et al., 17 Dec 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DreamPRM-Code.