InterTrack: Tracking Human Object Interaction without Object Templates (2408.13953v1)

Published 25 Aug 2024 in cs.CV

Abstract: Tracking human object interaction from videos is important to understand human behavior from the rapidly growing stream of video data. Previous video-based methods require predefined object templates while single-image-based methods are template-free but lack temporal consistency. In this paper, we present a method to track human object interaction without any object shape templates. We decompose the 4D tracking problem into per-frame pose tracking and canonical shape optimization. We first apply a single-view reconstruction method to obtain temporally-inconsistent per-frame interaction reconstructions. Then, for the human, we propose an efficient autoencoder to predict SMPL vertices directly from the per-frame reconstructions, introducing temporally consistent correspondence. For the object, we introduce a pose estimator that leverages temporal information to predict smooth object rotations under occlusions. To train our model, we propose a method to generate synthetic interaction videos and synthesize in total 10 hour videos of 8.5k sequences with full 3D ground truth. Experiments on BEHAVE and InterCap show that our method significantly outperforms previous template-based video tracking and single-frame reconstruction methods. Our proposed synthetic video dataset also allows training video-based methods that generalize to real-world videos. Our code and dataset will be publicly released.

Citations (2)

View on Semantic Scholar

Summary

The paper presents a novel template-free tracking approach using CorrAE for human reconstruction and TOPNet for robust object pose estimation.
It jointly optimizes human and object models to ensure temporally coherent interactions, improving the F-score to 0.5169 on the BEHAVE dataset.
The method leverages the synthetic ProciGen-Video dataset to boost training robustness and outperforms traditional template-based baselines even under occlusions.

InterTrack: Tracking Human Object Interaction without Object Templates

The paper "InterTrack: Tracking Human Object Interaction without Object Templates" introduces a method for tracking dynamic human-object interactions from monocular RGB videos without relying on predefined object templates. This approach addresses the limitations of prior methods that either require object templates or lack temporal consistency across frames.

Methodology

The authors present a well-architected solution to the complex problem of tracking human-object interactions by decomposing the 4D tracking problem into two main components: per-frame pose tracking and global shape optimization.

Human Reconstruction with Autoencoder (CorrAE): To achieve temporally consistent human points, the authors propose CorrAE, an autoencoder that efficiently predicts SMPL vertices directly from per-frame 3D human reconstructions. This allows the recovery of temporally coherent reconstructions that align well with subsequent frames. The use of a structured latent space enables the direct prediction of SMPL parameters, facilitating the disentanglement of human pose and shape parameters.
Object Tracking with Temporal Pose Estimator (TOPNet): For object tracking, the authors introduce TOPNet, a transformer-based network that leverages temporal information to predict object rotations, ensuring smooth pose transitions even under occlusion. The temporal design in TOPNet allows for accurate estimation of object pose, which is optimized based on both 3D point clouds and 2D silhouette consistency.
Joint Optimization for Plausible Interaction: The final step in their pipeline involves jointly optimizing the human and object models to ensure realistic interactions, especially focusing on contact points predicted from initial reconstructions. This joint optimization further consolidates the coherency and accuracy of the reconstructed human-object interactions.

Synthetic Data Generation: ProciGen-Video

A significant contribution of this work is the generation of a synthetic dataset, ProciGen-Video. This dataset includes 10 hours of synthetic interaction videos (8.5k sequences), which is crucial for training video-based methods. It offers a diverse range of object shapes and human interactions, enabling better generalization and robustness in real-world applications.

Experimental Results

Quantitative Metrics:

The experiments conducted on BEHAVE and InterCap datasets demonstrate that InterTrack significantly outperforms previous template-based and template-free methods. For instance, on the BEHAVE dataset, InterTrack achieved a combined F-score of 0.5169, improving upon the previous best of 0.4622 by HDM. The authors also show that training on their synthetic ProciGen-Video dataset helps boost performance, demonstrating effective transfer learning capabilities.

Comparison with Baselines:

InterTrack surpasses both CHORE and VisTracker in tracking accuracy. Given that both these methods depend on object templates, whereas InterTrack does not, this marks a considerable improvement. The exhaustive evaluation highlights InterTrack's robustness under occlusions and dynamic motions.

Implications and Future Directions

Practical Implications:

This method holds substantial potential for real-world applications where predefined object templates are unavailable or impractical to obtain. Fields like augmented reality, human-robot interaction, and behavior analysis could benefit from the enhanced accuracy and robustness provided by InterTrack's template-free tracking mechanism.

Theoretical Implications:

On the theoretical front, InterTrack underscores the importance of temporal consistency in tracking and the effective use of synthetic data for training complex models. This work also opens avenues for exploring advanced neural architectures and optimization methods to further push the boundaries of dynamic interaction tracking.

Future Developments:

Possible future directions include:

Integrating texture modeling for a more comprehensive understanding of interactions.
Expanding synthetic data generation to include more diverse objects and interactions, enhancing model robustness.
Extending the framework to multi-human, multi-object scenarios and incorporating deformable object tracking.

In conclusion, InterTrack represents a significant step forward in tracking human-object interactions without object templates. Its innovative use of synthetic data and advanced neural network architectures establishes a new benchmark in dynamic scene understanding. The release of the ProciGen-Video dataset is a noteworthy contribution that promises to stimulate further research and innovation in this domain.

PDF Markdown

Related Papers

Tweets

https://twitter.com/XianghuiXie/status/1828794800520466767

https://twitter.com/TheHumanoidHub/status/1829193795944296704