MP1 Algorithm: One-Step Generative Policy Learning
- MP1 Algorithm is a generative policy learning method for robotic manipulation that leverages the MeanFlow paradigm to compute one-step trajectories using 3D point cloud and state data.
- It utilizes an interval-averaged velocity formulation solved via a U-Net model along with dispersive regularization to enable efficient few-shot generalization.
- Empirical results demonstrate MP1's superior performance with lower inference latency and higher success rates compared to diffusion-based and traditional flow-based approaches.
MP1 is a generative policy learning algorithm for robotic manipulation that leverages the MeanFlow paradigm to deliver one-step (1-NFE) trajectory generation for high-dimensional, context-rich policy inference using 3D point cloud observations and robot state histories. It addresses the trade-off between the slow, autoregressive sampling of diffusion-based policies and the consistency constraints necessary for classical flow-based policies by introducing a new formulation—interval-averaged velocity learning—that is solved efficiently using a U-Net–based model and a dispersive regularizer for few-shot generalization (Sheng et al., 14 Jul 2025).
1. Theoretical Foundations: MeanFlow and the MeanFlow Identity
Traditional Flow Matching learns the instantaneous velocity field under the continuity equation , requiring the solution of an ODE during inference. MP1 instead models the interval-averaged velocity:
Rather than estimating pointwise and integrating, MP1 exploits the MeanFlow Identity:
where the total derivative is . This relates the interval average to the local velocity, enabling direct, closed-form trajectory prediction without iterated integration or structural consistency losses. The approach eliminates numerical ODE-solver errors at inference.
2. Policy Architecture and One-Step Inference
The MP1 network receives:
- A sequence of raw point-clouds
- A sequence of proprioceptive robot states
Separate encoders extract visual () and state (0) features, which are concatenated as the conditional code 1. The downstream network is a U-Net, which, for given noisy trajectory 2, interval endpoints 3, and condition 4, predicts 5.
At inference, with 6 and a single Gaussian sample 7,
8
This constitutes "true" 1-NFE (one network function evaluation): a single forward pass suffices, and no numerical integration is required.
3. Training Objective: Classifier-Free Guidance and Dispersive Loss
MP1 uses two primary objective terms:
(a) CFG regression loss:
9
with 0, and
1
Classifier-Free Guidance (CFG) is incorporated by randomly dropping the conditioning 2 with some probability:
3
which is used in 4 for regression.
(b) Dispersive Loss:
5
where 6 are down-block latent representations for different batch samples. This repels embedding vectors in latent space, improving generalization in few-shot learning and discouraging latent collapse.
The total objective is:
7
4. Empirical Performance and Ablation Results
On Adroit and Meta-World benchmarks, MP1 outperforms both DP3 (diffusion-based, 10 NFE) and FlowPolicy (flow-based, 1 NFE) in both average success rate and latency:
| Method | NFE | Avg. Success (\%) | Avg. Inference Time (ms) |
|---|---|---|---|
| DP3 | 10 | 68.7 | 132.2 |
| FlowPolicy | 1 | 71.6 | 12.6 |
| MP1 | 1 | 78.9 | 6.8 |
Ablation studies show:
- Removing Dispersive Loss reduces average success by ~5%.
- Performance is maximized at intermediate interval ratios 8 (not 9, i.e., not classical flow matching), confirming the benefit of interval-averaged flows.
- MP1 maintains superior performance even in few-shot imitation regimes (2–5 demonstrations), with near-state-of-the-art results by 10 demonstrations, attributable to improved discriminative capacity in the latent space.
In real-world robotic manipulation tasks on the ARX R5 dual-arm, MP1 achieves the highest task success rates and fastest average completion times in comparison to both baselines.
5. Comparison with Alternative Generative Policy Methods
MP1 achieves:
- 19x reduction in inference latency versus diffusion-based DP3, with improved success rates (0).
- Nearly 2x faster inference and 1 higher success than the explicit flow-based FlowPolicy.
- True one-step policy generation, with no consistency loss or numerical ODE-solver artifacts.
This is enabled by its local learning strategy (via the MeanFlow Identity), one-step trajectory computation (1-NFE), and lightweight, batch-wide dispersive regularization.
6. Implementation and Evaluation Protocols
The reported implementation uses:
- Batch size 128, AdamW 2
- Farthest-point sampling for 512 or 1024 points
- Downsampled images to 3
- Training for 3000 epochs (Adroit) or 1000 (Meta-World), with evaluation every 200 epochs, and top 5 checkpoint selection per seed.
Evaluation comprises 10 expert demos per task in benchmarks, 20 human demos per task in real-world, and assessment on five real tasks. MP1 is deployed with a U-Net backbone and 3D-conditioned inputs on NVIDIA RTX4090 hardware.
7. Context, Limitations, and Prospective Directions
By integrating the MeanFlow identity into the policy learning process, MP1 bypasses the need for multi-step iterative sampling and structural consistency. This permits both high-frequency control and robust generalization when only a few demonstrations are available. The main empirical findings demonstrate not only improved task performance but a substantial drop in inference time, which is critical for real-world closed-loop robotic control.
A plausible implication is that interval-averaged flow architectures with dispersive regularization will become the design of choice for settings where latency, sample efficiency, and robust few-shot generalization are all priorities (Sheng et al., 14 Jul 2025). The need for only a single function evaluation at deployment may facilitate embedded and resource-constrained robotic systems. However, the long-term impact and generalization beyond the tested benchmarks will depend on continued evaluation across more heterogeneous tasks and sensor modalities.