Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 71 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training (2509.01819v1)

Published 1 Sep 2025 in cs.RO

Abstract: This paper introduces ManiFlow, a visuomotor imitation learning policy for general robot manipulation that generates precise, high-dimensional actions conditioned on diverse visual, language and proprioceptive inputs. We leverage flow matching with consistency training to enable high-quality dexterous action generation in just 1-2 inference steps. To handle diverse input modalities efficiently, we propose DiT-X, a diffusion transformer architecture with adaptive cross-attention and AdaLN-Zero conditioning that enables fine-grained feature interactions between action tokens and multi-modal observations. ManiFlow demonstrates consistent improvements across diverse simulation benchmarks and nearly doubles success rates on real-world tasks across single-arm, bimanual, and humanoid robot setups with increasing dexterity. The extensive evaluation further demonstrates the strong robustness and generalizability of ManiFlow to novel objects and background changes, and highlights its strong scaling capability with larger-scale datasets. Our website: maniflow-policy.github.io.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a unified visuomotor imitation framework that leverages consistency flow training to generate efficient robot manipulation policies.
  • It employs an innovative DiT-X transformer architecture with Beta time sampling, addressing inference inefficiencies and generalization issues.
  • Results demonstrate significant performance improvements on multi-task benchmarks and real-world platforms, achieving robust and efficient dexterous control.

ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training

Introduction and Motivation

ManiFlow presents a unified visuomotor imitation learning framework for general robot manipulation, targeting high-dimensional, dexterous action generation conditioned on multi-modal inputs (visual, language, proprioceptive). The method leverages flow matching with a continuous-time consistency training objective, enabling efficient and robust policy learning for complex manipulation tasks, including bimanual and humanoid scenarios. ManiFlow addresses key limitations of prior flow matching and diffusion-based policies, such as inference inefficiency, poor generalization, and inadequate multi-modal conditioning, by introducing architectural and algorithmic innovations. Figure 1

Figure 1: ManiFlow policy architecture processes 2D/3D visual, robot state, and language inputs, outputting action sequences via a DiT-X transformer and flow matching with consistency training.

Methodology

Flow Matching and Consistency Training

ManiFlow builds upon the flow matching paradigm, where the policy learns to predict the velocity from noise to data along a straight ODE path. The core loss is:

LFM(θ)=Ex0,x1∼D[∥vθ(xt,t)−(x1−x0)∥2]\mathcal{L}_\mathrm{FM}(\theta) = \mathbb{E}_{x_0,x_1\sim D}[\|v_\theta(x_t,t) - (x_1 - x_0)\|^2]

To improve sample efficiency and enable few-step inference, ManiFlow incorporates a continuous-time consistency training objective. This enforces self-consistency along the ODE trajectory, allowing the model to generate high-quality actions in 1-2 inference steps without teacher distillation. The consistency loss is:

LCT(θ)=Et,Δt∼U[0,1][∥vθ(xt,t,Δt)−v~target∥2]\mathcal{L}_\mathrm{CT}(\theta) =\mathbb{E}_{t,\Delta t\sim\mathcal{U}[0,1]}\left[\| v_\theta(x_t, t, \Delta t) - \tilde{v}_{\text{target}} \|^2\right]

The training alternates between flow matching and consistency objectives, with EMA stabilization for target velocity estimation. Figure 2

Figure 2: ManiFlow consistency training samples intermediate points along the flow path, enforcing self-consistency and accurate mapping to the target.

Time Space Sampling Strategies

ManiFlow systematically ablates timestep sampling strategies, demonstrating that Beta distribution sampling (emphasizing high-noise regime) yields superior performance for robotic control compared to uniform, logit-normal, mode, and cosine-mapped schedules. Continuous Δt\Delta t sampling further improves consistency training efficacy. Figure 3

Figure 3: Comparison of timestep sampling strategies for flow matching, highlighting the empirical and theoretical distributions for Beta, logit-Normal, Mode, and Cosmap.

DiT-X Transformer Architecture

The DiT-X block extends the DiT and MDT architectures by introducing AdaLN-Zero conditioning to both self-attention and cross-attention layers. This enables adaptive, fine-grained feature modulation between action tokens and multi-modal inputs, crucial for handling low-dimensional control signals and high-dimensional perceptual/language features. Figure 4

Figure 4: DiT-X block applies AdaLN-Zero conditioning to cross-attention, enabling adaptive feature interactions for multi-modal policy learning.

Figure 5

Figure 5: DiT-X achieves faster convergence and higher accuracy in language-conditioned multi-task learning compared to DiT and MDT baselines.

Perception and Multi-Modal Conditioning

ManiFlow supports both 2D image and 3D point cloud inputs. The 3D encoder eschews max pooling to preserve fine-grained geometric information, critical for dexterous manipulation. Empirical results show strong performance with sparse point clouds (128 points) in calibrated scenes and improved robustness with denser clouds (4096 points) in unstructured environments. Color augmentation is essential for generalization in real-world deployment.

Experimental Results

Simulation Benchmarks

ManiFlow is evaluated on Adroit, DexArt, RoboTwin, and MetaWorld, covering single-arm, bimanual, and language-conditioned multi-task scenarios. ManiFlow consistently outperforms diffusion and flow matching baselines:

  • 2D image input: 43.4% and 45.6% improvement over diffusion and flow matching, respectively.
  • 3D point cloud input: 15.9% and 11.0% improvement.
  • MetaWorld multi-task: 31.4% and 34.9% improvement, with up to 125% gain on very hard tasks. Figure 6

    Figure 6: ManiFlow scaling comparison on Pick Apple Messy task, demonstrating superior data efficiency and performance with increasing demonstrations.

Robustness and Generalization

ManiFlow demonstrates strong generalization to novel objects, backgrounds, and environmental perturbations, outperforming large-scale pre-trained models (Ï€0\pi_0) in domain randomized bimanual tasks. Figure 7

Figure 7: Visualization of domain randomized evaluation on RoboTwin 2.0, including clutter, novel objects, lighting, and table height changes.

Real-World Experiments

ManiFlow is deployed on Franka, bimanual xArm, and Unitree H1 humanoid platforms, achieving 69.6% average success rate across 8 tasks—almost double that of DP3. ManiFlow excels in high-dexterity tasks (pouring, handover, sorting) and adapts to unseen objects and scene variations. Figure 8

Figure 8

Figure 8: Real-robot results across three platforms, with ManiFlow nearly doubling DP3's performance and robust execution visualized in 3D point clouds.

Figure 9

Figure 9: ManiFlow maintains robustness under real-world perturbations, including viewpoint changes, novel objects, and distractors.

Ablations and Scaling

ManiFlow achieves high success rates with only 1-2 inference steps, compared to 10 steps required by diffusion/flow matching baselines. The DiT-X block and Beta time sampling are critical for performance. ManiFlow exhibits strong scaling behavior, leveraging larger datasets more effectively than prior models.

Implementation Considerations

  • Computational Requirements: ManiFlow's DiT-X block introduces modest overhead but is offset by reduced inference steps and improved sample efficiency.
  • Deployment: Supports both 2D and 3D visual modalities, with flexible action horizon and observation history. EMA stabilization is essential for consistency training.
  • Limitations: Performance in contact-rich tasks is limited by lack of tactile sensing; future work should integrate tactile and VLM-based modalities.

Implications and Future Directions

ManiFlow demonstrates that consistency flow training and adaptive transformer architectures can substantially improve dexterous manipulation policy learning, enabling robust, efficient, and generalizable control from limited demonstrations. The approach is extensible to other domains (navigation, mobile manipulation) and can benefit from integration with reinforcement learning and additional sensory modalities.

Conclusion

ManiFlow advances the state-of-the-art in general robot manipulation by combining flow matching with continuous-time consistency training and a novel DiT-X transformer architecture. The method achieves strong empirical results across simulation and real-world benchmarks, particularly in challenging dexterous and bimanual tasks. ManiFlow's architectural and algorithmic innovations set a new standard for efficient, robust, and scalable policy learning in robotics.

X Twitter Logo Streamline Icon: https://streamlinehq.com