Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 18 tok/s
GPT-5 High 12 tok/s Pro
GPT-4o 96 tok/s
GPT OSS 120B 467 tok/s Pro
Kimi K2 217 tok/s Pro
2000 character limit reached

A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis (2308.07301v2)

Published 14 Aug 2023 in cs.CV, cs.GR, and cs.RO

Abstract: The synthesis of human motion has traditionally been addressed through task-dependent models that focus on specific challenges, such as predicting future motions or filling in intermediate poses conditioned on known key-poses. In this paper, we present a novel task-independent model called UNIMASK-M, which can effectively address these challenges using a unified architecture. Our model obtains comparable or better performance than the state-of-the-art in each field. Inspired by Vision Transformers (ViTs), our UNIMASK-M model decomposes a human pose into body parts to leverage the spatio-temporal relationships existing in human motion. Moreover, we reformulate various pose-conditioned motion synthesis tasks as a reconstruction problem with different masking patterns given as input. By explicitly informing our model about the masked joints, our UNIMASK-M becomes more robust to occlusions. Experimental results show that our model successfully forecasts human motion on the Human3.6M dataset. Moreover, it achieves state-of-the-art results in motion inbetweening on the LaFAN1 dataset, particularly in long transition periods. More information can be found on the project website https://evm7.github.io/UNIMASKM-page/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Can We Use Diffusion Probabilistic Models for 3D Motion Prediction? In 2023 IEEE International Conference on Robotics and Automation (ICRA).
  2. A spatio-temporal transformer for 3d human motion prediction. In International Conference on 3D Vision (3DV), 565–574. IEEE.
  3. Layer normalization. arXiv preprint arXiv:1607.06450.
  4. BEiT: BERT Pre-Training of Image Transformers. arXiv preprint arXiv:2106.08254.
  5. PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  6. MotionMixer: MLP-based 3D Human Body Pose Forecasting. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, 791–798. International Joint Conferences on Artificial Intelligence Organization.
  7. Anticipating many futures: Online human motion prediction and generation for human-robot interaction. In International Conference on Robotics and Automation (ICRA), 4563–4570. IEEE.
  8. A Unified 3D Human Motion Synthesis Model via Conditional Variational Auto-Encoder. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 11645–11655.
  9. MSR-GCN: Multi-Scale Residual Graph Convolution Networks for Human Motion Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 11467–11476.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  11. Single-Shot Motion Completion with Transformer. ArXiv, abs/2103.00776.
  12. Back to MLP: A Simple Baseline for Human Motion Prediction. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 4809–4819.
  13. Recurrent transition networks for character locomotion. SIGGRAPH Asia 2018 Technical Briefs.
  14. Robust motion in-betweening. ACM Transactions on Graphics (TOG), 39(4): 60–1.
  15. Robust motion in-betweening. ACM Trans. Graph.
  16. Masked Autoencoders Are Scalable Vision Learners.
  17. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
  18. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems (NeurIPS), 33: 6840–6851.
  19. A Deep Learning Framework for Character Motion Synthesis and Editing. ACM Trans. Graph.
  20. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(7): 1325–1339.
  21. Structural-rnn: Deep learning on spatio-temporal graphs. In Conference on Computer Vision and Pattern Recognition (CVPR), 5308–5317.
  22. A Dual-Masked Auto-Encoder for Robust Motion Capture with Spatial-Temporal Skeletal Token Completion. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, 5123–5131. New York, NY, USA: Association for Computing Machinery. ISBN 9781450392037.
  23. Language-Driven Representation Learning for Robotics.
  24. Conditional motion in-betweening. Pattern Recognition, 132.
  25. An introduction to variational autoencoders. Foundations and Trends in Machine Learning, 12(4): 307–392.
  26. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
  27. Mage: Masked generative encoder to unify representation learning and image synthesis. arXiv preprint arXiv:2211.09117.
  28. Ti-MAE: Self-Supervised Masked Time Series Autoencoders.
  29. Progressively Generating Better Initial Guesses Towards Next Stages for High-Quality Human Motion Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6437–6446.
  30. History repeats itself: Human motion prediction via motion attention. In European Conference on Computer Vision (ECCV), 474–489. Springer.
  31. Learning trajectory dependencies for human motion prediction. In International Conference on Computer Vision (ICCV), 9489–9497.
  32. On human motion prediction using recurrent neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2891–2900.
  33. Motion Inbetweening via Deep ΔΔ\Deltaroman_Δ-Interpolator. arXiv:2201.06701.
  34. Human Motion Diffusion Model. arXiv preprint arXiv:2209.14916.
  35. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. In Advances in Neural Information Processing Systems.
  36. Robust human motion forcasting using transformer-based model. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 10674–10680.
  37. Attention is all you need. Advances in neural information processing systems (NeurIPS), 30.
  38. Gimo: Gaze-informed human motion prediction in context. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIII, 676–694. Springer.
  39. On the Continuity of Rotation Representations in Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  40. Learning Human Motion Representations: A Unified Perspective. arXiv preprint arXiv:2210.06551.
Citations (2)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets