RHOBIN Challenge: Reconstruction of Human Object Interaction (2401.04143v1)

Published 7 Jan 2024 in cs.CV

Abstract: Modeling the interaction between humans and objects has been an emerging research direction in recent years. Capturing human-object interaction is however a very challenging task due to heavy occlusion and complex dynamics, which requires understanding not only 3D human pose, and object pose but also the interaction between them. Reconstruction of 3D humans and objects has been two separate research fields in computer vision for a long time. We hence proposed the first RHOBIN challenge: reconstruction of human-object interactions in conjunction with the RHOBIN workshop. It was aimed at bringing the research communities of human and object reconstruction as well as interaction modeling together to discuss techniques and exchange ideas. Our challenge consists of three tracks of 3D reconstruction from monocular RGB images with a focus on dealing with challenging interaction scenarios. Our challenge attracted more than 100 participants with more than 300 submissions, indicating the broad interest in the research communities. This paper describes the settings of our challenge and discusses the winning methods of each track in more detail. We observe that the human reconstruction task is becoming mature even under heavy occlusion settings while object pose estimation and joint reconstruction remain challenging tasks. With the growing interest in interaction modeling, we hope this report can provide useful insights and foster future research in this direction. Our workshop website can be found at \href{https://rhobin-challenge.github.io/}{https://rhobin-challenge.github.io/}.

References (78)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a challenge focused on 3D reconstruction of human-object interactions, detailing separate and combined task challenges.
It evaluates three subtasks, revealing that while individual human or object reconstruction performs well, joint reconstruction demands further optimization.
Results highlight advanced techniques such as dense correspondence mapping, keypoint estimation, and model ensemble strategies to boost accuracy.

Introduction

The intersection of human and object reconstruction in computer vision has birthed an exciting area focusing on the intricate task of capturing human-object interaction. The RHOBIN Challenge, designed to stimulate discourse and innovations within this domain, presents three subtasks centered around reconstruction from monocular RGB images. This challenge illustrates a broader enthusiasm among researchers for tackling the complexities of human-object interaction scenarios.

Challenge Structure

With over 100 participants submitting upwards of 300 entries, the challenge delineated into three segments. The first tasked participants with 3D human reconstruction, where the aim was to infer the 3D structure of a human from 2D images. The second dealt with the estimation of the rotation and translation (6DoF pose) of rigid objects. The final, and perhaps the most complex segment, required contestants to simultaneously reconstruct both the human figure and object exhibiting interaction dynamics.

Observations and Results

Post assessment, several insights were gathered. Notably, methods could reasonably perform human or object reconstruction separately but struggled with the joint task. Although direct regression techniques fared quite well, achieving high accuracy in combined reconstruction required an additional optimization step. The best performances were noted to employ dense correspondence mapping and keypoint estimation fundamentals. For human reconstruction alone, advanced techniques such as data augmentation and model ensemble played a significant role in enhancing outcomes.

Looking Ahead

Despite the commendable strides in separate reconstruction of humans or objects, combined interaction modeling still poses substantial challenges and underscores the need for innovative approaches. Future investigations could pivot towards video-based techniques that utilize temporal information or methods that don't rely on predefined object templates. Explorations could also extend towards capturing interactions involving multiple entities—people or objects—and incorporating more intricate elements such as appearance data for a comprehensive scene understanding. The continuous evolution of the RHOBIN challenge is expected to be a catalyst for interdisciplinary collaboration and breakthroughs in this research field.

PDF Markdown

Tweets

https://twitter.com/XianghuiXie/status/1746578808478511224

https://twitter.com/dqj5182/status/1746085728105459992