Detailed 2D-3D Joint Representation for Human-Object Interaction (2004.08154v2)

Published 17 Apr 2020 in cs.CV and cs.LG

Abstract: Human-Object Interaction (HOI) detection lies at the core of action understanding. Besides 2D information such as human/object appearance and locations, 3D pose is also usually utilized in HOI learning since its view-independence. However, rough 3D body joints just carry sparse body information and are not sufficient to understand complex interactions. Thus, we need detailed 3D body shape to go further. Meanwhile, the interacted object in 3D is also not fully studied in HOI learning. In light of these, we propose a detailed 2D-3D joint representation learning method. First, we utilize the single-view human body capture method to obtain detailed 3D body, face and hand shapes. Next, we estimate the 3D object location and size with reference to the 2D human-object spatial configuration and object category priors. Finally, a joint learning framework and cross-modal consistency tasks are proposed to learn the joint HOI representation. To better evaluate the 2D ambiguity processing capacity of models, we propose a new benchmark named Ambiguous-HOI consisting of hard ambiguous images. Extensive experiments in large-scale HOI benchmark and Ambiguous-HOI show impressive effectiveness of our method. Code and data are available at https://github.com/DirtyHarryLYL/DJ-RN.

Citations (124)

View on Semantic Scholar

Summary

The paper’s main contribution is the DJ-RN network that fuses detailed 2D and 3D features to overcome spatial ambiguities in HOI detection.
It employs a single-view human body capture method and object priors to estimate precise 3D body shapes and object properties from 2D data.
Experimental results on HICO-DET and Ambiguous-HOI benchmarks demonstrate state-of-the-art performance, especially for rare interaction categories.

Detailed 2D-3D Joint Representation for Human-Object Interaction

The paper presents a significant advancement in human-object interaction (HOI) detection by proposing a comprehensive 2D-3D joint representation learning methodology. The primary objective of this research is to refine the understanding of complex human-object interactions beyond traditional 2D approaches by incorporating detailed 3D human body shapes and estimated 3D object properties. This enhanced representation is intended to mitigate the shortcomings of 2D-based methods, such as ambiguity in spatial configurations and appearances due to varied viewpoints.

Overview

The proposed method leverages a detailed 3D representation of the human body, encompassing the shape of the entire body, as well as specific features like face and hand configurations. The researchers employ a single-view human body capture technique to generate these detailed 3D body shapes. In parallel, they estimate the 3D locations and sizes of objects using 2D human-object spatial information and predefined object category priors. This approach addresses the challenges associated with 6D pose estimation of objects from single-view images.

The core of the method is the Detailed Joint Representation Network (DJ-RN), which comprises two feature extractors: the 2D Representation Network (2D-RN) and the 3D Representation Network (3D-RN). The 2D-RN processes visual appearance and spatial information, while the 3D-RN focuses on the detailed 3D human body and the constructed 3D spatial configuration volume. The holistic fusion of these modalities is facilitated by several cross-modal consistency tasks, including spatial alignment and consistency in body part attention and semantic interpretations across 2D and 3D features.

Experimental Validation

The paper substantiates the efficacy of the proposed approach with extensive experimental evaluations on large-scale HOI benchmarks, namely HICO-DET and a newly introduced Ambiguous-HOI. HICO-DET is widely used and contains a vast array of human-object interaction annotations, while Ambiguous-HOI is introduced to specifically evaluate the model's capability in addressing 2D ambiguities common in real-world scenarios.

The results demonstrate that the DJ-RN achieves state-of-the-art performance on both datasets. Notably, the approach yields a significant enhancement on the rare categories within these datasets, highlighting its robustness across diverse and less frequent human-object interactions. The paper carefully articulates the comparative performance of different components of the network, such as the individual contributions of the 2D and 3D modules and their interplay through joint learning.

Implications and Future Work

The proposed method holds substantial implications for advancing action understanding and related tasks. By addressing the intrinsic limitations of 2D representations with detailed 3D reconstructions, this work paves the way for more robust and view-independent interaction recognition models. The integration of 3D volumes and detailed body shapes could further benefit applications such as imitation learning and enhanced interaction-based image captioning and visual reasoning.

The paper suggests that future research could explore dynamic scenes involving both static and moving objects to advance interaction understanding, as well as the extension of the framework to video data for spatio-temporal action recognition. Additionally, improving the efficiency of single-view 3D reconstructions and integrating more sophisticated object and human attribute estimation techniques could further enhance the representational capability of the proposed approach.

In conclusion, this paper provides an in-depth exploration of the benefits of combining 2D and 3D information for HOI detection, offering a promising approach to comprehending complex human interactions through the synergistic fusion of multimodal data.

PDF Markdown

Related Papers

GitHub

GitHub - DirtyHarryLYL/DJ-RN: As a part of HAKE project (HAKE-3D). Code for our CVPR2020 paper "Detailed 2D-3D Joint Representation for Human-Object Interaction". (100 stars)