MSG: Multi-Stream Generative Policies for Sample-Efficient Robotic Manipulation (2509.24956v1)

Published 29 Sep 2025 in cs.RO, cs.AI, and cs.LG

Abstract: Generative robot policies such as Flow Matching offer flexible, multi-modal policy learning but are sample-inefficient. Although object-centric policies improve sample efficiency, it does not resolve this limitation. In this work, we propose Multi-Stream Generative Policy (MSG), an inference-time composition framework that trains multiple object-centric policies and combines them at inference to improve generalization and sample efficiency. MSG is model-agnostic and inference-only, hence widely applicable to various generative policies and training paradigms. We perform extensive experiments both in simulation and on a real robot, demonstrating that our approach learns high-quality generative policies from as few as five demonstrations, resulting in a 95% reduction in demonstrations, and improves policy performance by 89 percent compared to single-stream approaches. Furthermore, we present comprehensive ablation studies on various composition strategies and provide practical recommendations for deployment. Finally, MSG enables zero-shot object instance transfer. We make our code publicly available at https://msg.cs.uni-freiburg.de.

Summary

The paper introduces MSG, a multi-stream, object-centric framework that reduces demonstration requirements by up to 95% in robotic manipulation tasks.
It leverages ensemble and flow composition strategies to integrate local models, leading to an 89% improvement over single-stream approaches.
MSG demonstrates robust zero-shot transfer using keypoint-based object estimation and validates its performance on both simulated and real-world tasks.

Multi-Stream Generative Policies for Sample-Efficient Robotic Manipulation

Introduction and Motivation

The paper introduces Multi-Stream Generative Policy (MSG), a framework for sample-efficient robotic manipulation that leverages object-centric, multi-stream policy learning within generative models. The motivation stems from the observation that while generative policies such as Diffusion and Flow Matching can model complex, multimodal behaviors, they are typically sample-inefficient, often requiring on the order of 100 demonstrations to achieve robust performance. In contrast, probabilistic, object-centric approaches can generalize from as few as five demonstrations by structuring learning around multiple object-centric streams. MSG extends this principle to generative policy learning, enabling inference-time composition of multiple object-centric models to improve both generalization and sample efficiency.

Figure 1: MSG learns high-quality policies from as few as five demonstrations by composing multiple object-centric models at inference, enabling sample-efficient generalization.

Technical Approach

Multi-Stream Object-Centric Policy Learning

MSG decomposes the policy learning problem into multiple object-centric streams, each modeling the robot's end-effector trajectory in a local coordinate frame associated with a task-relevant object or the robot itself. Each stream is trained independently using generative policy learning (e.g., Flow Matching), and at inference, the local models are transformed into the world frame and composed to yield a joint policy.

The key insight is that different object-centric frames provide complementary information at different stages of a manipulation task. For example, in the OpenMicrowave task, the end-effector frame is informative during the approach, while the microwave frame becomes critical during the interaction phase. The combined model maintains high precision throughout the trajectory by leveraging the most informative stream at each stage.

Figure 2: The informativeness of local models varies over the trajectory; the combined model maintains high precision by leveraging the most relevant stream at each time step.

Inference-Time Model Composition

MSG is model-agnostic and operates entirely at inference, making it compatible with a range of generative policy frameworks. The composition of object-centric models is performed via two main strategies:

Ensemble-Based Composition: Each stream generates a sample in its local frame, which is transformed to the world frame and combined via a weighted sum (using geodesic interpolation for pose). This approach is effective for unimodal, near-Gaussian distributions.
Flow Composition: The vector fields predicted by each stream are combined at each integration step, allowing the flows to guide the sample towards a common mode. This is particularly advantageous for multimodal distributions and can be further enhanced with MCMC-based sampling.

Figure 3: The ensemble approach integrates the individual streams before composing them, while flow composition jointly integrates the flow fields at each step.

Figure 4: For unimodal targets, the ensemble approximates the product distribution well; for multimodal targets, flow composition (with MCMC) encourages convergence to a common mode.

Stream Weighting Strategies

MSG supports several strategies for weighting the contribution of each stream during composition:

Progress-Based Scheduling: Weights are scheduled as a function of predicted skill progress (e.g., exponential, linear, threshold).
Variance-Based Weighting: Each stream predicts its own uncertainty (log variance), and weights are assigned inversely proportional to variance.
Parallel Sampling: Multiple particles are propagated in parallel, and empirical variances are used to compute weights.

Figure 5: Progress-based weighting schedules for stream composition.

These strategies allow MSG to flexibly adapt to task structure, either via simple schedules or by learning data-driven weighting directly from demonstrations.

Experimental Evaluation

Simulation Experiments

MSG is evaluated on a suite of RLBench tasks, including both single-object and multi-object scenarios requiring high precision and long-horizon reasoning.

Figure 6: RLBench and real-world tasks used for evaluation, covering a range of manipulation challenges.

MSG consistently outperforms all baselines, including standard Flow Matching, single-stream object-centric models, and Affordance-Centric Policy Learning (ACPL). With as few as five demonstrations, MSG achieves success rates that surpass standard generative policies trained on 100 demonstrations. On average, MSG reduces demonstration requirements by 95% and improves policy performance by 89% relative to single-stream approaches.

Data Scaling and Ablation

MSG maintains its performance advantage even as the number of demonstrations increases, indicating that its benefits are not limited to low-data regimes.

Figure 7: MSG maintains superior performance even as the number of demonstrations increases.

A comprehensive ablation study reveals that:

The exponential progress-based schedule is generally optimal, but variance-based and parallel sampling strategies offer robust, data-driven alternatives.
Flow composition with MCMC further improves performance, especially for tasks with high precision requirements or multimodal distributions.
Proper sample alignment and custom priors are critical for effective flow composition.

Real-World Experiments

MSG is validated on a Franka Emika Panda robot across four real-world tasks. Object frames are estimated using DINO-based keypoint detection, enabling zero-shot transfer to novel object instances and environments. MSG achieves strong performance with only 10 demonstrations, significantly outperforming both standard and single-stream object-centric baselines.

Theoretical and Practical Implications

MSG demonstrates that multi-stream, object-centric decomposition is a powerful inductive bias for generative policy learning, enabling sample-efficient generalization and robust policy composition. The inference-only, model-agnostic design allows MSG to be integrated with emerging generative policy frameworks without retraining, and its compatibility with keypoint-based object pose estimation enables zero-shot transfer across objects and environments.

The results challenge the prevailing assumption that generative policies are inherently sample-inefficient, showing that with appropriate object-centric structure and multi-stream composition, generative models can match or exceed the sample efficiency of probabilistic approaches.

Future Directions

Potential avenues for future research include:

Extending MSG to handle more complex, highly multimodal tasks via mixture models or hierarchical stream composition.
Integrating MSG with vision-based policies that operate directly from raw sensory input, leveraging advances in keypoint and pose estimation.
Exploring online adaptation and continual learning within the MSG framework to further reduce demonstration requirements in dynamic environments.

Conclusion

MSG establishes a new paradigm for sample-efficient generative policy learning in robotic manipulation by leveraging multi-stream, object-centric decomposition and inference-time model composition. The approach achieves strong generalization from minimal data, is compatible with a wide range of generative models, and supports zero-shot transfer to novel objects and environments. The findings have significant implications for the design of scalable, data-efficient robotic learning systems.