A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis (2308.07301v2)
Abstract: The synthesis of human motion has traditionally been addressed through task-dependent models that focus on specific challenges, such as predicting future motions or filling in intermediate poses conditioned on known key-poses. In this paper, we present a novel task-independent model called UNIMASK-M, which can effectively address these challenges using a unified architecture. Our model obtains comparable or better performance than the state-of-the-art in each field. Inspired by Vision Transformers (ViTs), our UNIMASK-M model decomposes a human pose into body parts to leverage the spatio-temporal relationships existing in human motion. Moreover, we reformulate various pose-conditioned motion synthesis tasks as a reconstruction problem with different masking patterns given as input. By explicitly informing our model about the masked joints, our UNIMASK-M becomes more robust to occlusions. Experimental results show that our model successfully forecasts human motion on the Human3.6M dataset. Moreover, it achieves state-of-the-art results in motion inbetweening on the LaFAN1 dataset, particularly in long transition periods. More information can be found on the project website https://evm7.github.io/UNIMASKM-page/
- Can We Use Diffusion Probabilistic Models for 3D Motion Prediction? In 2023 IEEE International Conference on Robotics and Automation (ICRA).
- A spatio-temporal transformer for 3d human motion prediction. In International Conference on 3D Vision (3DV), 565–574. IEEE.
- Layer normalization. arXiv preprint arXiv:1607.06450.
- BEiT: BERT Pre-Training of Image Transformers. arXiv preprint arXiv:2106.08254.
- PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- MotionMixer: MLP-based 3D Human Body Pose Forecasting. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, 791–798. International Joint Conferences on Artificial Intelligence Organization.
- Anticipating many futures: Online human motion prediction and generation for human-robot interaction. In International Conference on Robotics and Automation (ICRA), 4563–4570. IEEE.
- A Unified 3D Human Motion Synthesis Model via Conditional Variational Auto-Encoder. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 11645–11655.
- MSR-GCN: Multi-Scale Residual Graph Convolution Networks for Human Motion Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 11467–11476.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Single-Shot Motion Completion with Transformer. ArXiv, abs/2103.00776.
- Back to MLP: A Simple Baseline for Human Motion Prediction. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 4809–4819.
- Recurrent transition networks for character locomotion. SIGGRAPH Asia 2018 Technical Briefs.
- Robust motion in-betweening. ACM Transactions on Graphics (TOG), 39(4): 60–1.
- Robust motion in-betweening. ACM Trans. Graph.
- Masked Autoencoders Are Scalable Vision Learners.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems (NeurIPS), 33: 6840–6851.
- A Deep Learning Framework for Character Motion Synthesis and Editing. ACM Trans. Graph.
- Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(7): 1325–1339.
- Structural-rnn: Deep learning on spatio-temporal graphs. In Conference on Computer Vision and Pattern Recognition (CVPR), 5308–5317.
- A Dual-Masked Auto-Encoder for Robust Motion Capture with Spatial-Temporal Skeletal Token Completion. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, 5123–5131. New York, NY, USA: Association for Computing Machinery. ISBN 9781450392037.
- Language-Driven Representation Learning for Robotics.
- Conditional motion in-betweening. Pattern Recognition, 132.
- An introduction to variational autoencoders. Foundations and Trends in Machine Learning, 12(4): 307–392.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
- Mage: Masked generative encoder to unify representation learning and image synthesis. arXiv preprint arXiv:2211.09117.
- Ti-MAE: Self-Supervised Masked Time Series Autoencoders.
- Progressively Generating Better Initial Guesses Towards Next Stages for High-Quality Human Motion Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6437–6446.
- History repeats itself: Human motion prediction via motion attention. In European Conference on Computer Vision (ECCV), 474–489. Springer.
- Learning trajectory dependencies for human motion prediction. In International Conference on Computer Vision (ICCV), 9489–9497.
- On human motion prediction using recurrent neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2891–2900.
- Motion Inbetweening via Deep ΔΔ\Deltaroman_Δ-Interpolator. arXiv:2201.06701.
- Human Motion Diffusion Model. arXiv preprint arXiv:2209.14916.
- VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. In Advances in Neural Information Processing Systems.
- Robust human motion forcasting using transformer-based model. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 10674–10680.
- Attention is all you need. Advances in neural information processing systems (NeurIPS), 30.
- Gimo: Gaze-informed human motion prediction in context. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIII, 676–694. Springer.
- On the Continuity of Rotation Representations in Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Learning Human Motion Representations: A Unified Perspective. arXiv preprint arXiv:2210.06551.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.