Ditto: Building Digital Twins of Articulated Objects from Interaction (2202.08227v3)

Published 16 Feb 2022 in cs.CV, cs.AI, and cs.RO

Abstract: Digitizing physical objects into the virtual world has the potential to unlock new research and applications in embodied AI and mixed reality. This work focuses on recreating interactive digital twins of real-world articulated objects, which can be directly imported into virtual environments. We introduce Ditto to learn articulation model estimation and 3D geometry reconstruction of an articulated object through interactive perception. Given a pair of visual observations of an articulated object before and after interaction, Ditto reconstructs part-level geometry and estimates the articulation model of the object. We employ implicit neural representations for joint geometry and articulation modeling. Our experiments show that Ditto effectively builds digital twins of articulated objects in a category-agnostic way. We also apply Ditto to real-world objects and deploy the recreated digital twins in physical simulation. Code and additional results are available at https://ut-austin-rpl.github.io/Ditto

Citations (84)

View on Semantic Scholar

Summary

The paper introduces a novel method that fuses pre- and post-interaction point clouds to jointly model object geometry and articulation.
It employs implicit neural representations with dense local feature decoding to overcome inaccuracies of global joint predictions.
Experimental results demonstrate superior mobile part reconstruction and precise joint estimation on both synthetic and real-world datasets.

An Examination of Ditto: Constructing Digital Twins from Interactive Perception

The paper "Ditto: Building Digital Twins of Articulated Objects from Interaction" presents a novel method for constructing digital twins of articulated real-world objects using visual observations from before and after an interaction. This method, designated as Ditto, focuses on addressing the challenges inherent in accurately capturing both the geometry and kinematic properties of articulated objects, which are essential for deployment in simulated environments.

Methodological Insights

Ditto leverages implicit neural representations to achieve joint modeling of part geometry and articulation. The system processes a pair of point clouds representing an object before and after manipulation. Through PointNet++ encodings and a subsequent self-attention mechanism, two sets of subsampled point features are fused to construct structured feature grids and planes. These features facilitate dense point feature decoding, effectively capturing both global and part-level object properties.

At the core of Ditto's methodology are implicit decoders tasked with predicting part occupancies, segmentations, and joint parameters. The model deciphers part-level geometric details and estimations of joint types, using dense predictions to improve robustness. This dense articulation estimation surpasses traditional methods that rely on global parameter predictions prone to inaccuracies.

Experimental Evaluations

The evaluation of Ditto involves two articulated object datasets: a synthetic dataset and the Shape2Motion dataset. Ditto is benchmarked against baselines such as A-SDF, Correspondence-based methods, and Global Joint predictors. Metrics focus on Chamfer distances for geometric evaluation and error measures for articulation estimation. Results indicate Ditto's superior performance, particularly in mobile part reconstruction and accurate joint estimation. Notably, Ditto's method of dense local feature encoding successfully mitigates common inaccuracies associated with global joint predictions.

Implications and Future Directions

The work outlined in this paper reflects significant progress towards automating the creation of digital twins for interactive and embodied AI applications in virtual and mixed reality environments. The capacity to automatically generate kinematic trees with precise part and joint details could enhance the efficiency and scalability of AI simulations. Moreover, the category-agnostic nature of Ditto facilitates its application to a broad spectrum of objects without bespoke model training.

Looking ahead, advancements in autonomous interactive perception could further enhance the realism and fidelity of recreated digital twins. Enhancing active perception could empower virtual agents to autonomously explore and interact with physical environments, fostering innovations in the autonomy of robots and agents within simulated ecosystems.

Conclusion

The introduction of Ditto demonstrates a comprehensive approach to digital twin creation through proficient use of implicit neural representations and interactive perception. Its ability to accurately encapsulate both geometric and articulative nuances of complex objects marks a significant stride in embodied AI research. Moreover, the implications of this research extend beyond academic inquiry, potentially transforming fields such as robotics, simulation training, and augmented reality by providing high-fidelity interactive object models.

PDF Markdown