ProphRL: Post-Training VLA with World Model
- ProphRL is a post-training framework for Vision-Language-Action policies that integrates a learned action-conditioned world model to predict future robot video outcomes.
- It employs the FA-GRPO objective and FlowScale mechanism to aggregate flow-step likelihoods and stabilize gradients, addressing the mismatch between imitation and task success.
- The framework bridges imitation learning and RL, yielding significant success gains on benchmarks and real robots while mitigating simulation and distribution shift challenges.
ProphRL is a post-training framework for Vision-Language-Action (VLA) policies that combines a learned world model, an RL objective adapted to flow-based action heads, and a gradient-rescaling mechanism for stable optimization. In the formulation introduced by "Reinforcing Action Policies by Prophesying" (Zhang et al., 25 Nov 2025), ProphRL consists of three tightly coupled components: Prophet, a learned action-conditioned world model that predicts future robot videos from current observations and candidate actions; FA-GRPO, an RL objective that aggregates internal flow-step likelihoods at the environment-action level; and FlowScale, a stepwise reweighting method that rescales per-step gradients in the flow head. The framework is designed to address the mismatch between imitation-trained VLAs, which optimize demonstration likelihood, and downstream robot control, which is evaluated by task completion.
1. Conceptual scope and problem setting
ProphRL is situated in the context of VLA policies that map language and visual input to robot actions but are typically trained purely by supervised imitation. The motivating claim is that imitation-only VLAs maximize the likelihood of demonstrated actions rather than task success, creating an objective mismatch for long-horizon manipulation. The paper argues that such policies are brittle under distribution shift, accumulate errors over time, and observe too few failures to learn recovery behavior (Zhang et al., 25 Nov 2025).
The framework is explicitly intended as a practical alternative to both large-scale real-robot online RL and classical hand-engineered simulation. Real-robot RL is presented as expensive because it requires hardware time, limited parallelism, and often human monitoring. Conventional simulators are described as difficult to engineer for contact-rich manipulation and prone to sim-to-real gaps, especially for RGB-based VLAs. ProphRL instead relies on a learned visual simulator that allows the policy to practice on imagined rollouts generated from predicted future observations.
This design places ProphRL between imitation-only post-training and direct real-world reinforcement learning. A plausible implication is that the framework is best understood as a world-model-based RL layer for already competent VLA policies rather than as a replacement for supervised pretraining.
2. System architecture and component decomposition
ProphRL is defined by three components: Prophet, FA-GRPO, and FlowScale. Prophet is the learned simulator; FA-GRPO is the RL objective; FlowScale is the optimization stabilizer. The end-to-end loop is: the VLA predicts an action chunk, Prophet predicts the corresponding future video clip, a reward model scores the rollout, and the policy is updated by FA-GRPO with FlowScale reweighting (Zhang et al., 25 Nov 2025).
The paper describes Prophet as an action-to-video model. Operationally, it takes an initial image, an action chunk, and a visual history buffer, then predicts the next video chunk. Those predicted frames are recursively fed back into the policy and world model to continue a closed-loop rollout. This makes the learned model rollout-ready in the sense required for RL post-training.
FA-GRPO and FlowScale address a specific architectural feature of the underlying VLAs: their action heads are flow-based. In such heads, an environment action is internally generated through multiple denoising or flow steps. The paper argues that naïvely applying PPO- or GRPO-style RL to those internal steps creates a mismatch because the robot executes environment actions, not denoising substeps. FA-GRPO therefore aggregates log-probabilities over internal flow steps first and then forms PPO-style ratios at the action level. FlowScale then rescales per-step gradients to compensate for the heteroscedasticity induced by the noise schedule.
This decomposition is central to the identity of ProphRL. It is not merely a world-model RL system, and it is not merely a modified GRPO objective. It is a coordinated framework in which the world model supplies synthetic trajectories and the RL method is specialized to the action-generation mechanics of flow-based VLAs.
3. Prophet world model
Prophet is a history-aware, action-conditioned video world model initialized from Cosmos-Predict2-2B-Video2World. It uses the Wan2.1 video autoencoder, with temporal/spatial compression factor , latent sizes , , , , and model size 2.058B parameters with DiT channel dimension (Zhang et al., 25 Nov 2025).
Its diffusion training objective is given as
$\mathcal{L}_{\text{diff}=\mathbb{E}_{\mathbf{z}_0,\epsilon\sim\mathcal{N}(0,\mathbf{I}),t} \Big[\left \| \epsilon - \epsilon_\theta\big(\mathbf{z}_t,t,f\big) \right \|_{2}^{2} \Big],$
with
The model predicts future latent video conditioned on actions and observation context.
Action conditioning is dual-pathway. First, the scalar action stream flattens an action chunk and embeds it as
$f_{\text{sa} = \phi(c_{1:T}) \in \mathbb{R}^{D_{m}.$
Second, Prophet constructs action frames by projecting end-effector pose and orientation into image space and rendering them on a black canvas. The resulting action-video latent is
0
and is processed by lightweight 3D convolutions before being added as conditioning. This allows the model to receive both compact numeric action information and geometry-aware image-space action information.
The low-level action tensor is
1
with
2
where 3 is translational delta, 4 is rotational delta in Euler angles, and 5 is gripper open ratio. Pose updates are written as
6
Prophet is pretrained on a heterogeneous robot dataset mixture totaling over 31M sampled trajectories, including AgiBot, DROID, LIBERO, and filtered Open-X subsets. Few-shot adaptation uses LoRA rank 16 on 8 H200 GPUs. The model predicts 20 future frames per chunk, conditions on 20-step action chunks, and uses a 60-frame history buffer (Zhang et al., 25 Nov 2025).
The paper reports that Prophet can simulate both successes and plausible failures, including missed grasps, slipping grasps, drift after contact, interactions with irrelevant objects, and contorted arm poses. This suggests that the world model is intended not only as a predictor of nominal behavior but also as a generator of failure cases that are informative for RL.
4. RL objective: FA-GRPO and FlowScale
The policy’s flow-based action head factorizes log-likelihood over internal flow steps: 7 The key methodological point is that the internal 8-steps do not advance environment time. FA-GRPO therefore aggregates them before constructing importance ratios. The paper defines
9
0
and
1
The FA-GRPO objective is written as
2
where 3 is a variable-length mask and 4 is an action-level advantage broadcast across action dimensions.
Rewards are trajectory-level outputs of a frozen reward model: 5 These rewards are group-normalized as
6
with group statistics computed over the sampled rollout set, and advantages are then broadcast by
7
FlowScale modifies this objective by introducing per-flow-step weights: 8 with weights constructed from the noise schedule via
9
0
followed by normalization, mixing with a uniform baseline using 1, and clipping to 2.
The rationale given in the paper is that low-noise refinement steps have much larger score norms and therefore dominate gradients. FlowScale approximately balances
3
across 4, yielding the heuristic target
5
This does not alter the rollout semantics or reward definition; it only rebalances optimization across the internal denoising process.
5. Training pipeline and reward modeling
The ProphRL training loop begins from an imitation-trained VLA. For each rollout, the policy receives the current observation 6, predicts chunked low-level actions, and Prophet generates the corresponding future video clip. The last predicted frame is then used as the next observation for the policy, producing a closed-loop imagined trajectory. After rollout completion, a frozen reward model assigns a scalar score, which is normalized across the sampled group and used in FA-GRPO optimization (Zhang et al., 25 Nov 2025).
Reward modeling differs across domains. On LIBERO, the paper uses a visually adapted reward model trained by executing the same policy in the simulator and the world model with identical actions, using simulator success as labels and world-model videos as inputs. On BRIDGE and the real robot, it uses Qwen2.5-VL-72B zero-shot as the reward model, sampling 20 frames and evaluating five times with voting to produce a binary success/failure score. For LIBERO, one reward-model variant also predicts estimated completion step for temporal masking.
The VLA backbones evaluated are VLA-Adapter-0.5B, Pi0.5-3B, and OpenVLA-OFT-7B. RL uses 7, total batch size 256, mini-batch size 128, 100 RL steps, and 8 H200 GPUs. The VLA policies use single-image input per step and lightweight flow action heads with 7D delta-action outputs. SFT uses batch size 64, learning rate 8, AdamW, weight decay 0.1, and 200k steps; task-specific real-robot SFT uses 200 trajectories per task, batch size 16, and 50k steps per task (Zhang et al., 25 Nov 2025).
A recurring methodological claim is that reward-model recall matters more than precision. The paper reports that high recall is most important for effective learning, whereas moderate false-positive rates are tolerable. This suggests that ProphRL is especially sensitive to missing successful imagined trajectories, because those are the key positive signals available to the policy.
6. Empirical performance, ablations, and limitations
On SimplerEnv-WidowX/BRIDGE, ProphRL improves all three VLA variants. For VLA-Adapter-0.5B, overall average success rises from 23.3 ± 2.2 to 41.0 ± 2.4 with FA-GRPO + FlowScale. For Pi0.5-3B, it rises from 38.9 ± 2.6 to 51.0 ± 1.2. For OpenVLA-OFT-7B, it rises from 25.0 ± 1.8 to 30.9 ± 0.6. These correspond to the paper’s reported 5–17% success gains on public benchmarks (Zhang et al., 25 Nov 2025).
On the real UR30e robot, the gains are larger. VLA-Adapter-0.5B improves from 35.8 ± 3.1 to 60.4 ± 0.7 overall, Pi0.5-3B from 52.1 ± 3.8 to 82.1 ± 0.7, and OpenVLA-OFT-7B from 35.4 ± 0.7 to 62.9 ± 0.7. These are the source of the paper’s 24–30% gains on real robots (Zhang et al., 25 Nov 2025).
Ablations indicate that Prophet is stronger than a generic Cosmos baseline as a rollout model. For VLA-Adapter on BRIDGE, baseline SFT yields 23.3 overall success, Cosmos + FA-GRPO + FlowScale reaches 37.1, and Prophet + FA-GRPO + FlowScale reaches 41.0. In few-shot world-model fine-tuning, Cosmos-Few reaches 32.3 while Prophet-Few reaches 36.5. Additional few-shot RL results show 34.7 overall success with 10 images/task and 41.0 with 100 images/task, starting from the same 23.3 baseline.
On LIBERO, ProphRL improves over imitation but remains below RL in a strong simulator. Baseline VLA-Adapter overall success is 79.9 ± 2.2. Simulator RL with FA-GRPO + FlowScale reaches 90.1 ± 3.5, while Prophet-based RL with the same optimizer reaches 84.5 ± 1.1. The paper interprets this as evidence that world-model RL is useful when no good simulator exists, but does not outperform a high-fidelity simulator where one is available.
The paper also evaluates Prophet directly. On pretrained held-out validation, it reports, for example, PSNR 27.05, SSIM .8916, tSSIM .7666, EPE mean .2959, cosine mean .2144 on AgiBot, and PSNR 26.29, SSIM .9075, tSSIM .8639, EPE mean .1660, cosine mean .4164 on LIBERO. For BRIDGE and custom real-robot settings, Prophet is consistently reported as best or near-best on visual fidelity and strongest on action-consistency metrics such as EPE and cosine flow alignment.
Several limitations are explicit. The framework is computationally heavy because RL repeatedly queries a 2B world model. Long-horizon rollouts accumulate geometric and contact drift. Reward-model noise can degrade policy learning. Gains from world-model RL are smaller than gains from simulator RL on LIBERO. FlowScale is motivated by an approximate Gaussian rationale rather than a fully general derivation. The paper therefore presents ProphRL as a practical route to VLA post-training under constrained real-world interaction, not as a final solution to world-model bias or flow-based RL optimization.
A plausible synthesis is that ProphRL should be understood as a systems contribution rather than a single algorithmic novelty. Its significance lies in showing that a few-shot adaptable action-to-video world model, an action-level GRPO variant, and noise-schedule-aware gradient balancing can be combined into a workable RL post-training stack for modern VLAs (Zhang et al., 25 Nov 2025).