PICO: Reconstructing 3D People In Contact with Objects (2504.17695v1)

Published 24 Apr 2025 in cs.CV

Abstract: Recovering 3D Human-Object Interaction (HOI) from single color images is challenging due to depth ambiguities, occlusions, and the huge variation in object shape and appearance. Thus, past work requires controlled settings such as known object shapes and contacts, and tackles only limited object classes. Instead, we need methods that generalize to natural images and novel object classes. We tackle this in two main ways: (1) We collect PICO-db, a new dataset of natural images uniquely paired with dense 3D contact on both body and object meshes. To this end, we use images from the recent DAMON dataset that are paired with contacts, but these contacts are only annotated on a canonical 3D body. In contrast, we seek contact labels on both the body and the object. To infer these given an image, we retrieve an appropriate 3D object mesh from a database by leveraging vision foundation models. Then, we project DAMON's body contact patches onto the object via a novel method needing only 2 clicks per patch. This minimal human input establishes rich contact correspondences between bodies and objects. (2) We exploit our new dataset of contact correspondences in a novel render-and-compare fitting method, called PICO-fit, to recover 3D body and object meshes in interaction. PICO-fit infers contact for the SMPL-X body, retrieves a likely 3D object mesh and contact from PICO-db for that object, and uses the contact to iteratively fit the 3D body and object meshes to image evidence via optimization. Uniquely, PICO-fit works well for many object categories that no existing method can tackle. This is crucial to enable HOI understanding to scale in the wild. Our data and code are available at https://pico.is.tue.mpg.de.

Summary

An Analysis of PICO: A Framework for 3D Human-Object Interaction Reconstruction

The paper presents "PICO," an innovative framework aiming to facilitate 3D reconstruction of human-object interactions from single images. This research endeavors to address existing challenges linked to depth ambiguities, occlusions, and the diverse variations in object shape and appearance typically encountered in natural images. The framework is constructed around two pivotal contributions: a novel dataset and an advanced optimization-based method.

Contributions and Methodology

PICO Dataset:

One of the key contributions is the creation of a new dataset consisting of natural images paired with dense 3D contact annotations on both the human and the object. The dataset leverages on DECO, but extends previous annotations by including objects, whereas DECO was primarily body-focused. The authors have also adopted a unique approach that utilizes vision foundation models to retrieve 3D object meshes from a database, which are subsequently projected onto images in a manner that requires minimal human input.

PICO Method:

The authors propose a "render-and-compare fitting method" named "PICO," which recovers 3D body and object meshes. This method estimates contact for the body, identifies a likely 3D object mesh and contact from previously mentioned retrieval processes, and iteratively fits the 3D meshes to image evidence using optimization techniques. A distinguishing feature of the method is its ability to handle numerous object classes previously untackled by existing methods, thereby scaling understanding in natural environments.

Results and Implications

The paper offers quantitative evaluations of PICO against state-of-the-art methods across existing datasets, demonstrating superior performance in generalization to unseen in-lab datasets and in-the-wild images. Particularly notable are improvements in Procrustes-Aligned Chamfer Distance across human and object meshes. Qualitative analysis further supports enhanced realism in reconstruction when judged against competing methods.

The implications of this research have practical bearings in domains requiring nuanced human-centric applications, such as smart home technologies, mixed reality interfaces, and assistive robotics. The capability of PICO to operate effectively on novel object classes augments its potential for broader applicability in real-world settings.

Future Directions

Future advancement could steer towards refining performative aspects of contact estimation and expanding dataset coverage to bolster robustness through potential direct training of contact regressors. Additionally, incorporating advancements in vision-LLMs might pave the route for broadening the database efficacy beyond current limitations.

Conclusion

The authors have crafted a potent framework to decode complex human-object interactions through innovative data aggregation and optimization methodologies. While challenges persist regarding the handling of occlusions and deeply ambiguous interactions, the foundation established by PICO propels research trajectories toward comprehensive 3D interaction understanding in unregulated environments. This paper elucidates a significant step in advancing computational methods for reconstructing dynamic human-object metadata effectively, with promising implications for AI-centered interaction systems.

Related Papers

Find Related Papers

Tweets

https://twitter.com/taziku_co/status/1930982642742432147

YouTube

Show All Videos