Dynamic 3D Gaussian Tracking for Graph-Based Neural Dynamics Modeling (2410.18912v1)
Abstract: Videos of robots interacting with objects encode rich information about the objects' dynamics. However, existing video prediction approaches typically do not explicitly account for the 3D information from videos, such as robot actions and objects' 3D states, limiting their use in real-world robotic applications. In this work, we introduce a framework to learn object dynamics directly from multi-view RGB videos by explicitly considering the robot's action trajectories and their effects on scene dynamics. We utilize the 3D Gaussian representation of 3D Gaussian Splatting (3DGS) to train a particle-based dynamics model using Graph Neural Networks. This model operates on sparse control particles downsampled from the densely tracked 3D Gaussian reconstructions. By learning the neural dynamics model on offline robot interaction data, our method can predict object motions under varying initial configurations and unseen robot actions. The 3D transformations of Gaussians can be interpolated from the motions of control particles, enabling the rendering of predicted future object states and achieving action-conditioned video prediction. The dynamics model can also be applied to model-based planning frameworks for object manipulation tasks. We conduct experiments on various kinds of deformable materials, including ropes, clothes, and stuffed animals, demonstrating our framework's ability to model complex shapes and dynamics. Our project page is available at https://gs-dynamics.github.io.
- Building machines that learn and think like people. Behavioral and brain sciences, 40:e253, 2017.
- C. Finn and S. Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793. IEEE, 2017.
- Deep dynamics models for learning dexterous manipulation. In Conference on Robot Learning, pages 1101–1112. PMLR, 2020.
- Keypoints into the future: Self-supervised correspondence in model-based reinforcement learning. arXiv preprint arXiv:2009.05085, 2020.
- Learning to simulate complex physics with graph networks. In International conference on machine learning, pages 8459–8468. PMLR, 2020.
- Learning mesh-based simulation with graph networks. arXiv preprint arXiv:2010.03409, 2020.
- Dynamic-resolution model learning for object pile manipulation. arXiv preprint arXiv:2306.16700, 2023.
- Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. In 3DV, 2024.
- Tap-vid: A benchmark for tracking any point in a video. Advances in Neural Information Processing Systems, 35:13610–13626, 2022.
- Tapir: Tracking any point with per-frame initialization and temporal refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10061–10072, 2023.
- Particle video revisited: Tracking through occlusions using point trajectories. In European Conference on Computer Vision, pages 59–75. Springer, 2022.
- Cotracker: It is better to track together. arXiv preprint arXiv:2307.07635, 2023.
- Tracking everything everywhere all at once. In International Conference on Computer Vision, 2023.
- Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 343–352, Los Alamitos, CA, USA, jun 2015. IEEE Computer Society. doi:10.1109/CVPR.2015.7298631. URL https://doi.ieeecomputersociety.org/10.1109/CVPR.2015.7298631.
- Spatialtracker: Tracking any 2d pixels in 3d space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
- Neural trajectory fields for dynamic novel view synthesis. arXiv preprint arXiv:2105.05994, 2021.
- Md-splatting: Learning metric deformation from 4d gaussians in highly deformable scenes. arXiv preprint arXiv:2312.00583, 2023.
- Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manipulation, 2024.
- Flow as the cross-domain manipulation interface. arXiv preprint arXiv:2407.15208, 2024.
- Robotap: Tracking arbitrary points for few-shot visual imitation. arXiv, 2023.
- Any-point trajectory modeling for policy learning, 2023.
- Physgaussian: Physics-integrated 3d gaussians for generative dynamics. arXiv preprint arXiv:2311.12198, 2023.
- Deformation capture and modeling of soft objects. ACM Trans. Graph., 34(4):94–1, 2015.
- Differentiable simulation of soft multi-body systems. In Conference on Neural Information Processing Systems (NeurIPS), 2021.
- Pac-nerf: Physics augmented continuum neural radiance fields for geometry-agnostic system identification. arXiv preprint arXiv:2303.05512, 2023.
- PhysDreamer: Physics-based interaction with 3d objects via video generation. arxiv, 2024.
- Learning to manipulate deformable objects without demonstrations. arXiv preprint arXiv:1910.13439, 2019.
- Learning neural constitutive laws from motion observations for generalizable pde dynamics. In International Conference on Machine Learning, pages 23279–23300. PMLR, 2023.
- Densephysnet: Learning dense physical object representations via multi-step dynamic interactions. arXiv preprint arXiv:1906.03853, 2019.
- Context is everything: Implicit identification for dynamics adaptation. In 2022 International Conference on Robotics and Automation (ICRA), pages 2642–2648. IEEE, 2022.
- Comphy: Compositional physical reasoning of objects and events from videos. arXiv preprint arXiv:2205.01089, 2022.
- Robocraft: Learning to see, simulate, and shape elasto-plastic objects in 3d with graph networks. The International Journal of Robotics Research, 0(0):02783649231219020, 0. doi:10.1177/02783649231219020. URL https://doi.org/10.1177/02783649231219020.
- Robocook: Long-horizon elasto-plastic object manipulation with diverse tools. arXiv preprint arXiv:2306.14447, 2023.
- Learning visible connectivity dynamics for cloth smoothing. In Conference on Robot Learning, pages 256–266. PMLR, 2022.
- Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. In ICLR, 2019.
- Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
- Phenaki: Variable length video generation from open domain textual descriptions. In International Conference on Learning Representations, 2022.
- Nüwa: Visual synthesis pre-training for neural visual world creation. In European conference on computer vision, pages 720–736. Springer, 2022.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
- Video generation models as world simulators, 2024.
- Action-conditional video prediction using deep networks in atari games. Advances in neural information processing systems, 28, 2015.
- Unsupervised learning for physical interaction through video prediction. Advances in neural information processing systems, 29, 2016.
- Dmotion: Robotic visuomotor control with unsupervised forward model learned from videos. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7135–7142. IEEE, 2021.
- Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114, 2023.
- Video language planning. In The Twelfth International Conference on Learning Representations, 2023.
- Video as the new language for real-world decision making. ICML, 2024.
- Dreamitate: Real-world visuomotor policy learning via video generation, 2024.
- Generative camera dolly: Extreme monocular dynamic novel view synthesis. arXiv preprint arXiv:2405.14868, 2024.
- Learning to Act from Actionless Video through Dense Correspondences. arXiv:2310.08576, 2023.
- Learning universal policies via text-guided video generation. Advances in Neural Information Processing Systems, 36, 2024.
- 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023. URL https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Propagation networks for model-based control under partial observation. In ICRA, 2019.
- S. Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis & Machine Intelligence, 13(04):376–380, 1991.
- Embedded deformation for shape manipulation. ACM Trans. Graph., 26(3):80–es, jul 2007. ISSN 0730-0301. doi:10.1145/1276377.1276478. URL https://doi.org/10.1145/1276377.1276478.
- Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes. arXiv preprint arXiv:2312.14937, 2023.
- E. Camacho and C. Alba. Model Predictive Control. Advanced Textbooks in Control and Signal Processing. Springer London, 2013. ISBN 9780857293985. URL https://books.google.com/books?id=tXZDAAAAQBAJ.
- Model predictive path integral control: From theory to parallel computation. Journal of Guidance, Control, and Dynamics, 40(2):344–357, 2017.
- Pointodyssey: A large-scale synthetic dataset for long-term point tracking. In ICCV, 2023.
- Application of a particle-in-cell method to solid mechanics. Computer physics communications, 87(1-2):236–252, 1995.
- A moving least squares material point method with displacement discontinuity and two-way rigid body coupling. ACM Transactions on Graphics (TOG), 37(4):1–14, 2018.