Dynamic object manipulation with Vision-Language-Action models

Determine whether and how Vision-Language-Action models can reliably perform dynamic object manipulation tasks that require rapid perception, temporal anticipation, and continuous control despite inference latency and continuously evolving object states.

Background

The paper argues that dynamic object manipulation poses stricter real-time demands than static tasks due to inference latency that desynchronizes perception from action and requires temporal anticipation. Existing VLA models have demonstrated strong generalization in static settings but struggle when objects move during perception and execution, leading to failures from delayed reactions and misaligned actions.

To address these issues, the authors propose blue, a compact VLA architecture with Continuous Inference and Latent-aware Action Streaming designed to reduce the perception–execution gap. The Dynamic Object Manipulation (DOM) benchmark is introduced to provide large-scale dynamic data in simulation and the real world, emphasizing the need for approaches that can meet real-time constraints while maintaining multimodal reasoning.

References

Manipulating dynamic objects remains an open challenge for Vision-Language-Action (VLA) models, which, despite strong generalization in static manipulation, struggle in dynamic scenarios requiring rapid perception, temporal anticipation, and continuous control.

DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation  (2601.22153 - Xie et al., 29 Jan 2026) in Abstract