A Mixture of Experts Approach to 3D Human Motion Prediction (2405.06088v1)
Abstract: This project addresses the challenge of human motion prediction, a critical area for applications such as au- tonomous vehicle movement detection. Previous works have emphasized the need for low inference times to provide real time performance for applications like these. Our primary objective is to critically evaluate existing model ar- chitectures, identifying their advantages and opportunities for improvement by replicating the state-of-the-art (SOTA) Spatio-Temporal Transformer model as best as possible given computational con- straints. These models have surpassed the limitations of RNN-based models and have demonstrated the ability to generate plausible motion sequences over both short and long term horizons through the use of spatio-temporal rep- resentations. We also propose a novel architecture to ad- dress challenges of real time inference speed by incorpo- rating a Mixture of Experts (MoE) block within the Spatial- Temporal (ST) attention layer. The particular variation that is used is Soft MoE, a fully-differentiable sparse Transformer that has shown promising ability to enable larger model capacity at lower inference cost. We make out code publicly available at https://github.com/edshieh/motionprediction
- Real-time human motion capture with multiple depth cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2016. The University of British Columbia.
- Egolocate: Real-time motion capture, localization, and mapping with sparse body-mounted sensors. arXiv, abs/2305.01599, 2023. URL https://ar5iv.labs.arxiv.org/html/2305.01599.
- Spotr: Spatio-temporal pose transformers for human motion prediction, March 2023. URL https://arxiv.org/abs/2303.06277.
- A spatio-temporal transformer for 3d human motion prediction. arXiv preprint arXiv:2004.08692, 2020. URL https://arxiv.org/abs/2004.08692.
- Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020. URL https://arxiv.org/abs/2006.16668.
- From sparse to soft mixtures of experts, 2023a. URL https://arxiv.org/abs/2308.00951.
- On human motion prediction using recurrent neural networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4674–4683, 2017.
- Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 4346–4354. IEEE, 2015.
- Deep inertial poser: Predicting full body pose from sparse inertial measurements. Max-Planck-Gesellschaft, 2020. URL https://dip.is.tue.mpg.de/.
- fairmotion - tools to load, process and visualize motion capture data. Github, 2020. URL https://github.com/facebookresearch/fairmotion.
- Structured prediction helps 3d human motion modelling. arXiv preprint arXiv:1910.09070, 2019. URL https://arxiv.org/pdf/1910.09070.pdf.
- eth-ait. Motion transformer. https://github.com/eth-ait/motion-transformer, 2024. Accessed: 2024-04-23.
- Soft moe (mixture of experts). GitHub repository, 2023b. URL https://github.com/lucidrains/soft-moe-pytorch.
- Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, Dec 2017. URL https://arxiv.org/abs/1706.03762.
- Scaling vision with sparse mixture of experts. arXiv preprint arXiv:2106.05974, 2021. URL https://arxiv.org/abs/2106.05974.
- Amass: Archive of motion capture as surface shapes. In The IEEE International Conference on Computer Vision (ICCV), Oct 2019. URL https://amass.is.tue.mpg.de.
- Quaternet: A quaternion-based recurrent model for human motion. In British Machine Vision Conference 2018 (BMVC 2018), page 299, Newcastle, UK, Sept 3–6 2018. Northumbria University.