Dynamic object manipulation with Vision-Language-Action models
Determine whether and how Vision-Language-Action models can reliably perform dynamic object manipulation tasks that require rapid perception, temporal anticipation, and continuous control despite inference latency and continuously evolving object states.
References
Manipulating dynamic objects remains an open challenge for Vision-Language-Action (VLA) models, which, despite strong generalization in static manipulation, struggle in dynamic scenarios requiring rapid perception, temporal anticipation, and continuous control.
— DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation
(2601.22153 - Xie et al., 29 Jan 2026) in Abstract