- The paper demonstrates that discrete diffusion policies, with native inpainting pre-training, enable fine-tuning-free asynchronous action chunking in real-time control.
- Experimental results reveal significant improvements in solve rates, throughput, and reduced inference steps in both simulated and real-world robotic tasks.
- The approach outperforms flow-matching RTC by lowering computational overhead and enhancing closed-loop control efficiency, achieving notable success rates on challenging robotic tasks.
DiscreteRTC: Discrete Diffusion as Native Asynchronous Policy for Dynamic Control
Problem Context and Motivation
Effective real-time control in robotics mandates policies capable of issuing actions without pausing for inference, as real-world environments continue to evolve independently of computational delays. Traditional synchronous action chunking, even with high-throughput architectures, imposes essential pauses between action chunk generationโleading to discontinuities and degraded performance on dynamic tasks. Real-Time Chunking (RTC) partially addresses this by framing inter-chunk transitions as inpainting, yet extant approaches leverage flow-matching policy heads that are ill-suited for such structural requirements. The present work introduces DiscreteRTC, which recasts discrete diffusion models as inherently superior executors for asynchronous, real-time action chunking by exploiting their intrinsic inpainting capabilities.
Flow-Matching RTC: Structural Limitations
Extending RTC to flow-matching policies necessitates inference-time correction via IIGDM, soft-masking schedules, and often additional fine-tuning to adapt to inpainting scenarios. This structural mismatch introduces four primary deficits:
- Lack of Inpainting Pre-training: Flow-matching policies' standard training corrupts action chunks homogeneously, yielding ineffectual pre-training for the inconsistent noise distributions encountered during inpainting, thus precluding straightforward transfer to asynchronous tasks.
- Inpainting-Specific Fine-Tuning: Effective RTC requires dedicated post-hoc fine-tuning, e.g., action-suffix conditioning, to introduce inpainting-relevant noise structure, which imposes additional engineering and computational burden and can degrade generative quality.
- Heuristic, Non-Adaptive Guidance: Online inference with RTC requires hand-designed masking schedules, fixed across inference scenarios, resulting in suboptimal, rigid adaptation to asynchronous transitions.
- Excessive Inference Overhead: Correction terms for prefix-alignment (IIGDM) essentially double the computational load in deployment, conflicting with RTC's goal of lower-latency closed-loop control.
Consequently, scaling flow-matching policy capacity or training data does not alleviate these challenges, as demonstrated experimentally by the limited effectiveness of pre-training abstractions and the subpar throughput/success rates under varied latency regimes.
Discrete Diffusion Policies: Structure and Suitability
Discrete diffusion policies tokenize continuous action spaces and, through a native mask-inpaint-mimic training paradigm, are compelled from the outset to reconstruct partially observed action sequences. This paradigm intrinsically aligns with the requirements of chunk-inpainting in real-time asynchronous settings:
- Pre-training on Inpainting: The standard masked sequence modeling objective directly supports robust inpainting during inference for asynchronous execution, allowing performance to scale gracefully with model/data size.
- Fine-tuning-Free Deployment: Asynchronous policy transitions require no further loss engineering or fine-tuning schedules; native unmasking suffices.
- Natural Early-Exit Guidance: Unmasking can be truncated as soon as executable action tokens are available, and unfinished regions serve as an adaptive schedule for the next inference. There is no dependence on fixed weighting or externally tuned heuristics.
- Lower Per-Inference Overhead: Partial unmasking and early-stopping yield substantial reductions in computation compared to denoisers requiring full trajectory reconstruction at each chunk boundary.
The proposed DiscreteRTC harnesses these properties, replacing flow-matching heads in the RTC framework with discrete diffusion action heads, and empirically evaluates its architectural and operational merits.
Experimental Results
Simulated Dynamic Control: Kinetix
In Kinetix, across varied inference delays, DiscreteRTC demonstrates robust superiority over ContinuousRTC and other action-chunking baselines (Naive Async, Bidirectional Decoding). Key metrics articulated in the experiments include:
- Solve Rates: DiscreteRTC achieves consistently higher average solve rates under all tested delay conditions.
- Throughput: Shows improved episode completion per time window, directly attributing to the natural asynchrony and parallelism of diffusion-based inpainting.
- Iteration Steps: Fewer unmasking steps are required per inference, especially apparent under aggravated delay scenarios, substantiating lower inference and action-latency footprint.
Extended ablations confirm that DiscreteRTC outperforms flow-based policies even when those are fine-tuned specifically for RTC (e.g., Training-Time Continuous RTC, REMAC), and that incorporating advanced tokenizers or action representations (VFASH, FAST) can provide further gains.
Real-World Robotic Manipulation
Deploying on a UR5e platform, DiscreteRTC's advantages compound:
- Success Rate: On the hardest Dynamic Pick task, DiscreteRTC attains a 95% success rateโexceeding ContinuousRTC by 50% under identical experimental conditions.
- Inference Cost: The asynchronous cost with DiscreteRTC is 206 ms (compared to ~303 ms for discrete sync, and ~256 ms for ContinuousRTC inflated by IIGDM), corresponding to approximately 0.7x the cost of generating fresh action sequences.
- Reactive Execution: Synchronous baselines fail entirely (0% success rate), confirming that chunk-inpainting-enabled asynchrony is structurally mandated for high-frequency closed-loop control in non-stationary tasks.
These empirical results robustly validate DiscreteRTCโs hypothesis: discrete diffusion models, by virtue of joint pre-training and architectural structure, are natural asynchronous chunk executorsโoutperforming established flow-matching RTC solutions in both computational efficiency and closed-loop task completion.
Theoretical and Practical Implications
Theoretically, this work challenges the previous flow-matching hegemony in structural policies for vision-language-action models (VLAs) and reveals that the inpainting nature of discrete diffusionโwhen properly aligned with the operational requirements of asynchronous real-time controlโprovides a direct and transferable advantage. Practically, this enables:
- Fine-tuning-Free Deployment: DiscreteRTC allows scaling to larger policy networks and massive pre-training datasets without the need for inpainting-specific retraining.
- Modular Integration with Existing Pipelines: Compatible with emerging action tokenization methods and unified VLA architectures.
- Latency Reduction in Hard Real-Time Settings: The reduction in inference time and explicit handling of chunk transitions make DiscreteRTC suitable for deployment in highly dynamic, latency-sensitive applications (e.g., mobile manipulation, multi-agent systems, and locomotion).
Future Research Directions
The paper identifies limitations and several open lines of investigation that would further enhance the utility of DiscreteRTC:
- Temporal-Tokeneizers: Advances in action tokenization (e.g., FAST, OAT) may address sequence length and bandwidth inefficiency of k-bin quantization.
- Unified Architectures: Joint VLM-diffusion backbones for observation and action would remove modular bottlenecks and more closely couple inference dynamics with input reasoning.
- Adaptive Unmasking Schedules: Instead of max-confidence decoding, more structured unmasking (e.g., autoregressive block selection) could exploit the implicit schedule induced by RTC and further minimize inference cost.
Conclusion
DiscreteRTC demonstrates that discrete diffusion policies, via their native masked sequence modeling paradigm, are structurally and empirically better suited for asynchronous, inpainting-based action chunk generation in dynamic RL and robotics control. The approach yields pronounced gains in real-world and simulated benchmarks, obviates heuristic guidance and fine-tuning engineering, and sets a new baseline for low-latency, high-throughput vision-language-action models under real-time constraints. Further research in temporally compact tokenizations, backbone unification, and optimal unmasking schedules will likely compound these benefits in the near future.