- The paper introduces a hybrid particle-grid representation that combines Lagrangian particles with Eulerian grids to capture global shape and motion dynamics.
- The paper employs a PointNet encoder, neural velocity field prediction, and grid-to-particle integration to forecast particle motions under incomplete observations.
- The paper demonstrates improved 3D video prediction and model-based planning performance by integrating advanced vision models with RGB-D data.
This paper introduces Particle-Grid Neural Dynamics, a novel framework for learning predictive dynamics models of deformable objects directly from real-world RGB-D video recordings of robot interactions. The core idea is to leverage a hybrid particle-grid representation to overcome limitations of previous methods, such as physics-based simulators which struggle with real-world system identification and state estimation, and graph-based learning models which are sensitive to sparse representations and partial observations.
The proposed framework represents a deformable object as a set of particles (Lagrangian representation) while simultaneously utilizing a fixed spatial grid (Eulerian representation). The dynamics model learns a function that takes the history of particle positions and velocities, along with the robot's action, to predict the future velocities of the particles. This function is parameterized by a neural network comprising several key components:
- Point Encoder: A neural network (specifically, a PointNet architecture) processes the input particle positions and velocities to extract rich, per-particle latent features. This encoder is designed to capture global shape information and historical motion, making the model more robust to incomplete observations.
- Neural Velocity Field: An MLP function predicts a spatial velocity field on a fixed 3D grid. The MLP takes grid locations and locality-aware particle features (averaged features of nearby particles) as input, outputting a velocity vector for each grid point. This step regularizes the velocity predictions and enhances spatial continuity.
- Grid-to-Particle (G2P) Integration: Velocities predicted at the grid points are transferred back to the particles using a B-spline interpolation kernel, similar to techniques used in Material Point Method (MPM). This process yields the predicted velocities for each particle.
- Controlling Deformation: External interactions from the robot or environment (like ground contact) are incorporated using two methods:
- Grid Velocity Editing (GVE): For grasped interactions and ground contact, velocities on grid points near the interaction surface are directly modified to satisfy physical constraints (e.g., matching gripper velocity, projecting velocity off the ground with friction).
- Robot Particles (RP): For nonprehensile actions (e.g., pushing), the robot gripper is represented as additional particles, and the dynamics are learned on this augmented point cloud. This allows the model to learn the effect of pushing without rigidly constraining object particle motion.
The model is trained end-to-end on data collected from multi-view RGB-D videos of robots interacting with diverse objects. A key aspect of the training data pipeline is the use of foundational vision models (like Grounded-SAM-2 and CoTracker) for object segmentation and dense, persistent 3D particle tracking from the raw video data. The model is optimized by minimizing the Mean Squared Error (MSE) between predicted and ground truth particle positions over a short prediction horizon.
A significant practical application demonstrated in the paper is 3D action-conditioned video prediction. By integrating the predicted particle motions with 3D Gaussian Splatting (3DGS) reconstructions of the object, the framework can generate realistic video renderings of the learned dynamics. The predicted particle trajectories are used as control points to deform the Gaussian kernels via Linear Blend Skinning (LBS), allowing for realistic visualization of complex deformations.
Furthermore, the learned dynamics model is shown to be effective for model-based planning by integrating it into a Model Predictive Control (MPC) framework (specifically, MPPI). Given a target state (e.g., a desired object shape), the dynamics model is used to predict the outcome of sampled robot actions over a horizon, and actions are optimized to minimize the distance (e.g., Chamfer Distance) to the target. This enables goal-conditioned manipulation of deformable objects.
The experimental results demonstrate that Particle-Grid Neural Dynamics outperforms state-of-the-art physics-based (MPM) and learning-based (Graph-Based Neural Dynamics) baselines in dynamics prediction accuracy across a variety of challenging deformable objects, including ropes, cloth, plush toys, boxes, paper bags, and bread. The hybrid representation and global point encoder contribute to improved robustness under sparse visual observations (fewer camera views) and better generalization to unseen object instances within the same category. The integration with 3DGS yields higher-fidelity video predictions, and the application in MPC shows improved planning performance compared to the GBND baseline.
Despite its strengths, the framework has limitations, including assuming a fixed number of particles (making it challenging for scenarios involving particle appearance/disappearance), implicitly learning physical properties without explicit identification, and reliance on the accuracy of upstream vision models used for data collection and rendering.
Overall, the Particle-Grid Neural Dynamics framework presents a practical and effective approach for learning deformable object dynamics from real-world visual data, enabling high-fidelity simulation, video prediction, and model-based robotic manipulation.