MP1 Algorithm: One-Step Generative Policy Learning

Updated 24 June 2026

MP1 Algorithm is a generative policy learning method for robotic manipulation that leverages the MeanFlow paradigm to compute one-step trajectories using 3D point cloud and state data.
It utilizes an interval-averaged velocity formulation solved via a U-Net model along with dispersive regularization to enable efficient few-shot generalization.
Empirical results demonstrate MP1's superior performance with lower inference latency and higher success rates compared to diffusion-based and traditional flow-based approaches.

MP1 is a generative policy learning algorithm for robotic manipulation that leverages the MeanFlow paradigm to deliver one-step (1-NFE) trajectory generation for high-dimensional, context-rich policy inference using 3D point cloud observations and robot state histories. It addresses the trade-off between the slow, autoregressive sampling of diffusion-based policies and the consistency constraints necessary for classical flow-based policies by introducing a new formulation—interval-averaged velocity learning—that is solved efficiently using a U-Net–based model and a dispersive regularizer for few-shot generalization (Sheng et al., 14 Jul 2025).

1. Theoretical Foundations: MeanFlow and the MeanFlow Identity

Traditional Flow Matching learns the instantaneous velocity field $v(z_t, t)$ under the continuity equation $\frac{dz_t}{dt} = v(z_t, t)$ , requiring the solution of an ODE during inference. MP1 instead models the interval-averaged velocity:

$u(z_t, r, t) \triangleq \frac{1}{t - r} \int_r^t v(z_\tau, \tau) \, d\tau$

Rather than estimating $v$ pointwise and integrating, MP1 exploits the MeanFlow Identity:

$u(z_t, r, t) = v(z_t, t) - (t - r)\frac{d}{dt}u(z_t, r, t)$

where the total derivative is $\frac{d}{dt}u = v(z_t, t)\,\partial_z u + \partial_t u$ . This relates the interval average $u$ to the local velocity, enabling direct, closed-form trajectory prediction without iterated integration or structural consistency losses. The approach eliminates numerical ODE-solver errors at inference.

2. Policy Architecture and One-Step Inference

The MP1 network receives:

A sequence of raw point-clouds $P \in \mathbb{R}^{n_o \times n_p \times 3}$
A sequence of proprioceptive robot states $S \in \mathbb{R}^{n_o \times s_d}$

Separate encoders extract visual ( $f_v$ ) and state ( $\frac{dz_t}{dt} = v(z_t, t)$ 0) features, which are concatenated as the conditional code $\frac{dz_t}{dt} = v(z_t, t)$ 1. The downstream network is a U-Net, which, for given noisy trajectory $\frac{dz_t}{dt} = v(z_t, t)$ 2, interval endpoints $\frac{dz_t}{dt} = v(z_t, t)$ 3, and condition $\frac{dz_t}{dt} = v(z_t, t)$ 4, predicts $\frac{dz_t}{dt} = v(z_t, t)$ 5.

At inference, with $\frac{dz_t}{dt} = v(z_t, t)$ 6 and a single Gaussian sample $\frac{dz_t}{dt} = v(z_t, t)$ 7,

$\frac{dz_t}{dt} = v(z_t, t)$ 8

This constitutes "true" 1-NFE (one network function evaluation): a single forward pass suffices, and no numerical integration is required.

3. Training Objective: Classifier-Free Guidance and Dispersive Loss

MP1 uses two primary objective terms:

(a) CFG regression loss:

$\frac{dz_t}{dt} = v(z_t, t)$ 9

with $u(z_t, r, t) \triangleq \frac{1}{t - r} \int_r^t v(z_\tau, \tau) \, d\tau$ 0, and

$u(z_t, r, t) \triangleq \frac{1}{t - r} \int_r^t v(z_\tau, \tau) \, d\tau$ 1

Classifier-Free Guidance (CFG) is incorporated by randomly dropping the conditioning $u(z_t, r, t) \triangleq \frac{1}{t - r} \int_r^t v(z_\tau, \tau) \, d\tau$ 2 with some probability:

$u(z_t, r, t) \triangleq \frac{1}{t - r} \int_r^t v(z_\tau, \tau) \, d\tau$ 3

which is used in $u(z_t, r, t) \triangleq \frac{1}{t - r} \int_r^t v(z_\tau, \tau) \, d\tau$ 4 for regression.

(b) Dispersive Loss:

$u(z_t, r, t) \triangleq \frac{1}{t - r} \int_r^t v(z_\tau, \tau) \, d\tau$ 5

where $u(z_t, r, t) \triangleq \frac{1}{t - r} \int_r^t v(z_\tau, \tau) \, d\tau$ 6 are down-block latent representations for different batch samples. This repels embedding vectors in latent space, improving generalization in few-shot learning and discouraging latent collapse.

The total objective is:

$u(z_t, r, t) \triangleq \frac{1}{t - r} \int_r^t v(z_\tau, \tau) \, d\tau$ 7

4. Empirical Performance and Ablation Results

On Adroit and Meta-World benchmarks, MP1 outperforms both DP3 (diffusion-based, 10 NFE) and FlowPolicy (flow-based, 1 NFE) in both average success rate and latency:

Method	NFE	Avg. Success (\%)	Avg. Inference Time (ms)
DP3	10	68.7	132.2
FlowPolicy	1	71.6	12.6
MP1	1	78.9	6.8

Ablation studies show:

Removing Dispersive Loss reduces average success by ~5%.
Performance is maximized at intermediate interval ratios $u(z_t, r, t) \triangleq \frac{1}{t - r} \int_r^t v(z_\tau, \tau) \, d\tau$ 8 (not $u(z_t, r, t) \triangleq \frac{1}{t - r} \int_r^t v(z_\tau, \tau) \, d\tau$ 9, i.e., not classical flow matching), confirming the benefit of interval-averaged flows.
MP1 maintains superior performance even in few-shot imitation regimes (2–5 demonstrations), with near-state-of-the-art results by 10 demonstrations, attributable to improved discriminative capacity in the latent space.

In real-world robotic manipulation tasks on the ARX R5 dual-arm, MP1 achieves the highest task success rates and fastest average completion times in comparison to both baselines.

5. Comparison with Alternative Generative Policy Methods

MP1 achieves:

19x reduction in inference latency versus diffusion-based DP3, with improved success rates ( $v$ 0).
Nearly 2x faster inference and $v$ 1 higher success than the explicit flow-based FlowPolicy.
True one-step policy generation, with no consistency loss or numerical ODE-solver artifacts.

This is enabled by its local learning strategy (via the MeanFlow Identity), one-step trajectory computation (1-NFE), and lightweight, batch-wide dispersive regularization.

6. Implementation and Evaluation Protocols

The reported implementation uses:

Batch size 128, AdamW $v$ 2
Farthest-point sampling for 512 or 1024 points
Downsampled images to $v$ 3
Training for 3000 epochs (Adroit) or 1000 (Meta-World), with evaluation every 200 epochs, and top 5 checkpoint selection per seed.

Evaluation comprises 10 expert demos per task in benchmarks, 20 human demos per task in real-world, and assessment on five real tasks. MP1 is deployed with a U-Net backbone and 3D-conditioned inputs on NVIDIA RTX4090 hardware.

7. Context, Limitations, and Prospective Directions

By integrating the MeanFlow identity into the policy learning process, MP1 bypasses the need for multi-step iterative sampling and structural consistency. This permits both high-frequency control and robust generalization when only a few demonstrations are available. The main empirical findings demonstrate not only improved task performance but a substantial drop in inference time, which is critical for real-world closed-loop robotic control.

A plausible implication is that interval-averaged flow architectures with dispersive regularization will become the design of choice for settings where latency, sample efficiency, and robust few-shot generalization are all priorities (Sheng et al., 14 Jul 2025). The need for only a single function evaluation at deployment may facilitate embedded and resource-constrained robotic systems. However, the long-term impact and generalization beyond the tested benchmarks will depend on continued evaluation across more heterogeneous tasks and sensor modalities.

Markdown Report Issue Upgrade to Chat

References (1)

MP1: Mean Flow Tames Policy Learning in 1-step for Robotic Manipulation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MP1 Algorithm.