Mini Diffuser: Fast Multi-task Diffusion Policy Training Using Two-level Mini-batches (2505.09430v2)

Published 14 May 2025 in cs.RO and cs.LG

Abstract: We present a method that reduces, by an order of magnitude, the time and memory needed to train multi-task vision-language robotic diffusion policies. This improvement arises from a previously underexplored distinction between action diffusion and the image diffusion techniques that inspired it: In image generation, the target is high-dimensional. By contrast, in action generation, the dimensionality of the target is comparatively small, and only the image condition is high-dimensional. Our approach, \emph{Mini Diffuser}, exploits this asymmetry by introducing \emph{two-level minibatching}, which pairs multiple noised action samples with each vision-language condition, instead of the conventional one-to-one sampling strategy. To support this batching scheme, we introduce architectural adaptations to the diffusion transformer that prevent information leakage across samples while maintaining full conditioning access. In RLBench simulations, Mini-Diffuser achieves 95\% of the performance of state-of-the-art multi-task diffusion policies, while using only 5\% of the training time and 7\% of the memory. Real-world experiments further validate that Mini-Diffuser preserves the key strengths of diffusion-based policies, including the ability to model multimodal action distributions and produce behavior conditioned on diverse perceptual inputs. Code available at mini-diffuse-actor.github.io

Summary

The paper introduces Mini Diffuser, a novel method using Level-2 minibatching and architecture adaptations to train multi-task diffusion policies significantly faster.
Mini Diffuser achieves comparable performance (95%) to state-of-the-art diffusion policies while using only 5% of the training time and 7% of the memory.
This approach democratizes resource-intensive robotic policy training, making it feasible on consumer-grade hardware in less than a day and opening avenues for future efficiency gains.

Mini Diffuser: An Innovative Approach to Multi-Task Diffusion Policies in Robotic Manipulation

The paper Mini Diffuser: Train a Multi-Task Diffusion Policy on RLBench-18 in One Day with One GPU introduces a novel approach for training multi-task robotics policies using diffusion models. Leveraging the distinct characteristics between high-dimensional image targets and low-dimensional robot actions, the authors propose a method that dramatically reduces the computational requirements and training time, while delivering comparable performance to the current state-of-the-art models.

Diffusion models, which have gained prominence in tasks such as image generation, are increasingly being applied to decision-making scenarios, notably robotic control. These models typically involve computationally intensive denoising operations, which limit their efficiency beyond inference. The method introduced by the authors addresses this inefficiency by exploiting the asymmetry between the robot action space and the conditioning inputs—primarily vision-language conditions that are inherently more complex.

Key Innovations and Results

Level-2 Minibatching and Architecture Adaptations: The primary innovation of the paper lies in the Level-2 minibatching technique, where multiple noised action samples are paired with a single vision-language condition. This significantly enhances sample efficiency, reducing redundant computations associated with processing identical conditions repeatedly. The architectural modifications, including masked attention and adaptive normalization, ensure that information leakage is prevented while maintaining connectivity to conditioning specifics.

Performance Metrics: In terms of numerical results, the Mini Diffuser achieves approximately 95% of the performance of state-of-the-art diffusion policies while utilizing only 5% of the training time and 7% of the memory typically required. These figures underscore the efficacy of the proposed system, presenting a substantial reduction in computational overhead with minimal impact on task success rates. This capability enables the training of substantial multi-task diffusion policies using consumer-grade hardware in significantly shorter durations—highlightedly feasible in about 13 hours on an RTX 4090 GPU.

Implications and Future Directions

The practical implications of the Mini Diffuser extend into both economic and methodological realms. Reducing the computational load in training complex robotic policies allows researchers and practitioners to leverage more modest resources to achieve high-quality models. This democratization of resource-intensive processes could expand the accessibility of advanced robotic manipulation systems across varying institutions.

Theoretically, the innovations presented promote further inquiry into architectural adaptability concerning diffusion methodologies. The asymmetric nature of the conditioning versus action spaces invites deeper exploration into level-specific batching strategies, potentially leading to efficiencies in other domains where similar dynamics might exist.

The article also hints at broader potential applications and improvements, such as integrating step-skipping techniques to further enhance inference efficiency. Moreover, transitioning the diffusion model paradigm to dynamic task settings, incorporating real-time velocity controls, could provide a robust framework for future developments in AI-powered robotics.

In conclusion, the Mini Diffuser represents a significant stride in the efficient training of diffusion policies for robotic manipulation. This work sets a foundation for pursuing accelerated experimentation and robust deployment in real-world environments, bridging gaps between computational demand and model efficacy in complex multi-task scenarios.

Mini Diffuser: Fast Multi-task Diffusion Policy Training Using Two-level Mini-batches (2505.09430v2)

Summary

Mini Diffuser: An Innovative Approach to Multi-Task Diffusion Policies in Robotic Manipulation

Key Innovations and Results

Implications and Future Directions

Related Papers

GitHub

YouTube