Back to MLP: A Simple Baseline for Human Motion Prediction (2207.01567v3)

Published 4 Jul 2022 in cs.CV and cs.AI

Abstract: This paper tackles the problem of human motion prediction, consisting in forecasting future body poses from historically observed sequences. State-of-the-art approaches provide good results, however, they rely on deep learning architectures of arbitrary complexity, such as Recurrent Neural Networks(RNN), Transformers or Graph Convolutional Networks(GCN), typically requiring multiple training stages and more than 2 million parameters. In this paper, we show that, after combining with a series of standard practices, such as applying Discrete Cosine Transform(DCT), predicting residual displacement of joints and optimizing velocity as an auxiliary loss, a light-weight network based on multi-layer perceptrons(MLPs) with only 0.14 million parameters can surpass the state-of-the-art performance. An exhaustive evaluation on the Human3.6M, AMASS, and 3DPW datasets shows that our method, named siMLPe, consistently outperforms all other approaches. We hope that our simple method could serve as a strong baseline for the community and allow re-thinking of the human motion prediction problem. The code is publicly available at \url{https://github.com/dulucas/siMLPe}.

Citations (91)

View on Semantic Scholar

Summary

The paper presents siMLPe, an MLP-based model that challenges conventional complex architectures in human motion prediction.
It leverages techniques like Discrete Cosine Transform and residual displacement prediction, achieving lower MPJPE scores on benchmarks such as Human3.6M.
The approach reduces model parameters by 20-60x while maintaining state-of-the-art performance, advocating simpler design strategies.

A Simple Baseline for Human Motion Prediction Using MLPs

This paper presents a novel approach to human motion prediction, advocating for the use of a multi-layer perceptron (MLP) network, siMLPe, to serve as a strong yet straightforward baseline. The methodology significantly diverges from the typical trend of employing complex architectures like Recurrent Neural Networks (RNNs), Graph Convolutional Networks (GCNs), and Transformers. These traditionally favored networks, while effective, often involve substantial computational complexity and layer stacking strategies that make them resource-intensive and harder to interpret or modify.

Key Methodological Contributions

The authors propose that human motion can be accurately predicted using an MLP architecture augmented with standard practices like Discrete Cosine Transform (DCT), residual displacement prediction, and auxiliary velocity optimization. The proposed siMLPe network consists of fully connected layers coupled with layer normalization and transpose operations, forming a purely linear architecture, bar the layer normalization. This architecture eliminates unnecessary complexities and significantly reduces parameters without compromising performance.

Evaluation and Results

The paper reports an exhaustive evaluation of the siMLPe approach across several benchmarks, including Human3.6M, AMASS, and 3DPW datasets. Notably, the proposed method outperforms existing state-of-the-art models in prediction accuracy while achieving a parameter reduction by factors of 20 to 60 times compared to its counterparts. The use of MPJPE as a metric on these datasets substantiates these results, with siMLPe consistently achieving lower prediction errors. The simplicity of siMLPe does not undermine its effectiveness, as evidenced by the results.

Implications for Future Research

A critical implication of this work is the challenge it poses to the AI community to rethink the complexities often added to predictive models, highlighting that simpler, more efficient alternatives can be identified and deployed effectively. The adoption of MLPs here serves as a reminder of the necessity for optimization in model selection, balancing complexity, and performance.

Looking forward, this approach could lead to more accessible deployment of motion prediction models in practical applications, such as autonomous vehicles, robotics, and surveillance systems, due to its lightweight nature. It also opens avenues for further research into enhancing simple models through optimized learning techniques or integrations with other machine learning paradigms.

Conclusion

The authors of the paper encourage revisiting simpler architectures for modeling sequence prediction tasks. By achieving state-of-the-art results with a minimalist framework, their method sets a precedent for leveraging established, straightforward architectures in solving complex predictive tasks efficiently. This work not only contributes a high-performing model but also enriches the discussion on the necessity and implications of model complexity in AI research.