Dense Policy: Fine-Grained Reward Paradigms

Updated 7 June 2026

Dense Policy is a paradigm that replaces sparse, trajectory-level rewards with fine-grained, per-step or per-token signals for precise credit assignment.
It has demonstrated significant performance gains, such as boosting code generation pass@1 metrics from 47.87% to 62.12% in frameworks like VeRPO.
The approach supports rapid inference and robust training across domains like reinforcement learning, large language model fine-tuning, and robotic manipulation.

A dense policy is a policy optimization paradigm—originally formulated in both reinforcement learning and imitation learning contexts—that leverages rich, per-step or per-token reward/credit signals to enable fine-grained, data-efficient learning, in contrast to traditional sparse or outcome-based reward structures. Dense policy frameworks have been formalized and adopted across domains including code generation, LLM post-training, and robotic manipulation. The core advantages stem from their ability to provide granular feedback during training, enable rapid and robust policy improvement, and offer superior sample efficiency with reduced degeneracy in the learning signal.

1. Core Concepts and Variants of Dense Policy

The central notion is to replace coarse-grained, trajectory-level (“sparse”) rewards, with finely resolved, often per-action (“dense”) reward structures. This supports improved credit assignment, mitigating problems such as vanishing gradients, high-variance updates, and inability to distinguish early pivotal states or actions.

Distinct instantiations of dense policy include:

Per-token or per-step dense rewards: Each micro-action, token, or timestep is assigned an explicit or implicit reward, enabling gradient signals to propagate at high spatial resolution within episodes, as in VeRPO (Wang et al., 7 Jan 2026), FIPO (Ma et al., 20 Mar 2026), and Dense-Path REINFORCE (Li et al., 2 Oct 2025).
Multi-stage, coarse-to-fine autoregressive sequence generation: Dense Policy in robot manipulation implements sequence “densification,” interpolating keyframes in a logarithmic-time, bidirectionally autoregressive process for highly efficient action generation (Su et al., 17 Mar 2025, Su et al., 19 Sep 2025).
Dense surrogate reward construction via distillation or imitation: By treating expert or teacher models as reward-shaped agents, on-policy distillation transfers dense, per-token surrogate rewards to students (“reward-density principle”) (Xu et al., 12 May 2026).

The dense policy paradigm is not limited to a single domain or model type, but rather generalizes as a credit assignment and sequence modeling principle.

2. Mathematical Formulation and Training Objectives

RL and Supervised Learning Formulations

In dense policy reinforcement learning, the reward for a trajectory $\tau$ is decomposed as a sum over per-step or per-token components:

$R_{\text{dense}}(\tau) = \sum_{t=1}^T r_t(s_t, a_t)$

where $r_t$ may be derived from environment signals, teacher policy divergence, test execution outcomes, or log-probability advances.

VeRPO—Verifiable Dense Reward Construction

For code generation with execution-feedback, VeRPO introduces a dense reward derived from the (weighted) partial success of unit tests:

$R_{\text{dense}}(\tau) = \sum_{j=1}^N w_j \cdot I_j(\tau)$

where $w_j$ reflects inferred difficulty and $I_j(\tau)$ indicates test $u_j$ passing. Weights are adaptively estimated as $w_j = \exp(-\alpha \rho_j)$ , with empirical pass rate $\rho_j$ ; kernel density normalization further corrects for test similarity. Rewards are back-propagated both at per-turn and trajectory scope, with dual-level standardized advantages optimized under a PPO-style surrogate. This structure slashes the degenerate group ratio from 60–70% (GRPO) to <25% (VeRPO), enabling non-zero gradients whenever any test is passed (Wang et al., 7 Jan 2026).

FIPO—Dense Advantage via Discounted Future-KL

FIPO computes a per-token advantage by modulating the standard outcome-based advantage $\hat{A}_t$ with a (discounted) future KL-divergence, yielding

$R_{\text{dense}}(\tau) = \sum_{t=1}^T r_t(s_t, a_t)$ 0

where $R_{\text{dense}}(\tau) = \sum_{t=1}^T r_t(s_t, a_t)$ 1 and $R_{\text{dense}}(\tau) = \sum_{t=1}^T r_t(s_t, a_t)$ 2 is a discount factor. The objective is a PPO-style clipped ratio surrogate, which re-weights each token based on its down-stream influence (Ma et al., 20 Mar 2026).

Imitation and IRL View: Dense-Path REINFORCE

Supervised fine-tuning (SFT) for LLMs is formally equivalent to inverse Q-learning over demonstration trajectories, making the log-probability a per-token dense reward. Defining a baseline-relative reward

$R_{\text{dense}}(\tau) = \sum_{t=1}^T r_t(s_t, a_t)$ 3

enables unconstrained REINFORCE on this dense, shaped signal, circumventing length biases and sequence sparsity (Li et al., 2 Oct 2025).

Dense Policy for Sequence Generation

In robotic action generation, Dense Policy constructs a full trajectory in $R_{\text{dense}}(\tau) = \sum_{t=1}^T r_t(s_t, a_t)$ 4 network passes. At each level $R_{\text{dense}}(\tau) = \sum_{t=1}^T r_t(s_t, a_t)$ 5, it produces action tokens for all $R_{\text{dense}}(\tau) = \sum_{t=1}^T r_t(s_t, a_t)$ 6, and iteratively “densifies” via bidirectional refinement. The loss is typically $R_{\text{dense}}(\tau) = \sum_{t=1}^T r_t(s_t, a_t)$ 7 regression over predicted vs. ground-truth actions across levels (Su et al., 17 Mar 2025, Su et al., 19 Sep 2025).

3. Algorithmic Structures and Architectural Variants

Dense Policy implementations vary with domain, but share key algorithmic innovations:

Bidirectional, Coarse-to-Fine Generation: Action or state sequences are initialized with skeleton keyframes, with missing elements recursively interpolated and refined. This enables both global receptive-field dependencies and efficient scaling; inference times scale as $R_{\text{dense}}(\tau) = \sum_{t=1}^T r_t(s_t, a_t)$ 8 versus $R_{\text{dense}}(\tau) = \sum_{t=1}^T r_t(s_t, a_t)$ 9 for next-token models (Su et al., 17 Mar 2025). DSPv2 extends this paradigm to whole-body mobile manipulation by fusing 3D (sparse-conv) and multi-view 2D (DINOv2+LoRA) scene features, yielding tightly coordinated multi-DOF plans (Su et al., 19 Sep 2025).
Dense Reward Signal Propagation: In VeRPO, dynamic difficulty weighting and kernel-density correction generate robust gradients, while trajectory-level “global anchors” anchor the learning signal in end-to-end correctness, preventing reward gaming (Wang et al., 7 Jan 2026).
Surrogate Reward Distillation: On-policy distillation tasks student policies with locally trust-regioned updates under dense per-token teacher rewards, provided that the teacher is reward-shaped and the student-teacher KL is small (conditions C1, C2) (Xu et al., 12 May 2026).

4. Empirical Results and Comparative Performance

Empirical studies establish the practical impact of dense policy frameworks in several domains.

Code Generation and Reasoning

VeRPO improves pass@1 from 47.87% (GRPO outcome-only) to 62.12% on Qwen3-8B (multi-turn), with negligible computation overhead and degenerate-group ratio <25% (Wang et al., 7 Jan 2026).
FIPO breaches length plateaus in LLMs, extending average solution reasoning from ~4,000 to >10,000 tokens and raising AIME 2024 pass@1 to a peak of 58.0%, outperforming other large-scale baselines (Ma et al., 20 Mar 2026).
Reward-density principle four-stage pipeline achieves MATH-500 scores of up to 79.3% versus 75.9% for direct GRPO in LLM post-training; ablations confirm every stage (teacher RL, forward-KL, OPD) is load-bearing (Xu et al., 12 May 2026).
Dense-Path REINFORCE delivers +2–8 percentage point win-rate improvements and consistent MT-Bench gains (e.g., 5.74→6.01 on LLaMA-3.1-8B), surpassing SFT and prior demo-only baselines (Li et al., 2 Oct 2025).

Robotic Manipulation

Dense Policy in manipulation (3D): Success rates reach 72% on Adroit-Door, 85% on DexArt-Laptop, and 85% on real-world PutBread, substantially exceeding diffusion and next-token baselines (Su et al., 17 Mar 2025).
DSPv2 (whole-body): Achieves 80% pick/60% place on Pick&Place and 100% on Deliver (v. 45–55% for baselines); demonstrates ≤10pp drop in generalization (“distribution shift”) tests, while baselines drop 30–50pp (Su et al., 19 Sep 2025).

5. Theoretical and Practical Significance

Dense policy architectures directly address common limitations in RL, imitation, and LLM fine-tuning:

Gradient degeneracy reduction: By ensuring nontrivial gradient signals even for partially successful outcomes, dense policy updates avoid vanishing-gradient regimes endemic to sparse reward RL (Wang et al., 7 Jan 2026).
Granular, context-sensitive credit assignment: Discounted or shaped per-token rewards, as in FIPO and Dense-Path REINFORCE, enable the optimizer to reward early pivotal decisions and penalize non-contributory actions within long sequences (Ma et al., 20 Mar 2026, Li et al., 2 Oct 2025).
Improved sample efficiency and training stability: Dense rewards smooth out variance across batches, stabilize learning under changing task difficulty, and support robust generalization (Xu et al., 12 May 2026, Su et al., 19 Sep 2025).
Logarithmic-time sequence generation: Bidirectional, coarse-to-fine densification architectures support rapid inference and strong global sequence consistency, highly advantageous for high-DOF or long-horizon action planning (Su et al., 17 Mar 2025, Su et al., 19 Sep 2025).

6. Limitations and Open Directions

Current dense policy advances have acknowledged limitations:

Extension to large-scale, multi-domain vision-language-action frameworks remains an unsolved challenge, with potential stability and scalability issues anticipated for $r_t$ 0B-parameter backbones (Su et al., 17 Mar 2025).
The full theoretical properties of bidirectional, coarse-to-fine (Discrete or Hybrid) sequence modeling for control are open research topics.
In code generation, reward densification must balance partial credit with global anchors to avoid “gaming” partial credit at the expense of end-to-end correctness (Wang et al., 7 Jan 2026).
For distillation-based dense policies, guarantee of teacher “reward-shaping” and small student-teacher KL are necessary preconditions; violation leads to loss of dense signal informativeness or high-variance updates (Xu et al., 12 May 2026).

This suggests densification and dense credit assignment are broadly applicable research directions, but require careful domain tunings and domain-specific architecture adjustments.

7. Comparative Table of Representative Dense Policy Frameworks

Framework	Domain	Dense Criterion	Main Empirical Gain
VeRPO	Code Generation	Weighted test execution	+14.25pp pass@1 over GRPO
FIPO	LLM RL	Discounted future-KL (token)	+6pp AIME24 pass@1, CoT: 4k→10k+
Dense-Path RE	LLM demonstration	Baseline-relative log-prob	+2–8pp win-rate, +0.2–0.5 MT-Bench
Dense Policy	Robot Manipulation	Coarse-to-fine keyframe refill	72% Adroit-Door (vs. 62% diffusion)
DSPv2	Whole-body Manipulation	3D–2D fused, log-keyframe densification	80%/60% Pick&Place (vs. 50/25%)
OPD Bridge	LLM post-training	Dense teacher KL-surrogate	+3-4pp MATH-500 vs. baseline

Each approach demonstrates that dense reward or sequence-modeling structures, when appropriately tuned and grounded, yield substantial performance, efficiency, and robustness improvements relative to traditional policy learning paradigms.