Extending VLAs and data pipelines beyond rigid-body dynamics

Extend Vision-Language-Action models and dynamic manipulation data-collection pipelines to tasks involving non-rigid or fluid objects with continuously evolving states, in both simulation and real-world settings.

Background

The current DOM pipeline and experiments assume rigid-body state estimation, which simplifies perception and control. However, many real-world dynamic manipulation tasks involve non-rigid materials or fluids whose states are hard to represent and track, complicating both simulation and real-world deployment.

The authors highlight that accommodating these complex dynamics will require new modeling and state-estimation approaches, maintaining real-time performance and compatibility with the proposed VLA execution mechanisms.

References

Our data pipeline assumes rigid-body state estimation, whereas many dynamic tasks involve non-rigid or fluid dynamics with continuously evolving states that are difficult to represent in both simulation and the real world. Extending VLA models and data pipelines to such settings remains an open challenge.

DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation  (2601.22153 - Xie et al., 29 Jan 2026) in Section: Discussion and Future Work