DMMGAN: Diverse Multi Motion Prediction of 3D Human Joints using Attention-Based Generative Adverserial Network

Published 13 Sep 2022 in cs.CV, cs.LG, and cs.RO | (2209.09124v2)

Abstract: Human motion prediction is a fundamental part of many human-robot applications. Despite the recent progress in human motion prediction, most studies simplify the problem by predicting the human motion relative to a fixed joint and/or only limit their model to predict one possible future motion. While due to the complex nature of human motion, a single output cannot reflect all the possible actions one can do. Also, for any robotics application, we need the full human motion including the user trajectory not a 3d pose relative to the hip joint. In this paper, we try to address these two issues by proposing a transformer-based generative model for forecasting multiple diverse human motions. Our model generates \textit{N} future possible motion by querying a history of human motion. Our model first predicts the pose of the body relative to the hip joint. Then the \textit{Hip Prediction Module} predicts the trajectory of the hip movement for each predicted pose frame. To emphasize on the diverse future motions we introduce a similarity loss that penalizes the pairwise sample distance. We show that our system outperforms the state-of-the-art in human motion prediction while it can predict diverse multi-motion future trajectories with hip movements

Abstract PDF Upgrade to Chat

Citations (7)

View on Semantic Scholar

Summary

The paper introduces DMMGAN, an attention-based GAN that predicts multiple diverse 3D human motion sequences with high accuracy.
The model integrates a transformer-based encoder, GRUs, and a WGAN-GP discriminator to efficiently estimate both pose and hip trajectories.
Empirical evaluations demonstrate improved Average Displacement Error (ADE) and Final Displacement Error (FDE) compared to state-of-the-art methods.

Overview of "DMMGAN: Diverse Multi Motion Prediction of 3D Human Joints using Attention-Based Generative Adversarial Network"

The paper introduces DMMGAN, a novel model designed to address shortcomings in the domain of human motion prediction, focusing on multilateral capabilities in estimating human 3D motion trajectories and poses. The research aims to overcome limitations present in current generation models, particularly by producing diverse predictions for human body motion, which are not constrained to pose estimation relative to a fixed joint. This is achieved using a generative adversarial network (GAN) architecture coupled with an attention-based transformer to facilitate complex 3D joint trajectory predictions.

Model Architecture and Methodology

DMMGAN leverages the latest advancements in attention mechanisms and generative models to predict multiple potential future motions based on past 3D joint data. The model is built upon three main components:

3D Pose Module: This module uses a transformer-based encoder to predict multiple sequences of future human 3D poses. By encoding from past motions, it generates diverse predictions leveraging gated recurrent units (GRUs) to forecast sequences efficiently.
Hip Prediction Module: Extending beyond typical pose prediction methodologies, this module estimates the hip trajectory for each predicted pose, providing insights into the human body's movement in a global frame. Here, the past hip trajectory is encoded using a transformer to enhance this prediction, considering the predicted 3D pose.
Discriminator Module: Employing a Wasserstein GAN approach with gradient penalty (WGAN-GP), this component ensures the generation of realistic human motions by distinguishing between synthetic and authentic sequences from the Human 3.6M dataset.

The training process includes a unique combination of supervised losses (Best Loss, Teacher Forcing Loss, Similarity Loss, and Joint Loss) and the unsupervised discriminator loss, aiming to balance diversity with prediction accuracy while preserving joint consistency.

Experimental Evaluation

The DMMGAN model is rigorously evaluated against existing methods like DLow and STPOTR. Results demonstrated that DMMGAN not only produces more diverse future motion predictions (measured by Average Pairwise Distance or APD) but also enhances prediction accuracy, marked improvements in Average Displacement Error (ADE) and Final Displacement Error (FDE). Through qualitative analysis, DMMGAN is shown to surpass the existing state-of-the-art in predicting diverse possible futures, which is especially beneficial for applications in robotics and autonomous systems where an understanding of plausible human trajectories is critical.

Implications and Future Directions

The dual capability of diversifying human motion prediction while accurately determining poses and trajectories makes DMMGAN especially suitable for real-world applications like human-robot interactions, surveillance, and autonomous driving. Practically, this could lead to enhanced safety protocols by enabling systems to anticipate multiple human movements proactively.

On the theoretical frontier, this work suggests a robust framework that might inspire further research into transformer-based generative models for more granular motion estimation tasks. Extending this model to incorporate real-time data streams and deploying it in uncontrolled environments could be significant future endeavors. Additionally, integrating data from different sensory inputs such as video for end-to-end multimodal prediction could also be a valuable line of investigation.

This work lays down substantial groundwork in pushing boundaries within human motion prediction, balancing computational efficiency with model output diversity and accuracy.

Markdown