HMD-Poser: On-Device Real-time Human Motion Tracking from Scalable Sparse Observations (2403.03561v1)
Abstract: It is especially challenging to achieve real-time human motion tracking on a standalone VR Head-Mounted Display (HMD) such as Meta Quest and PICO. In this paper, we propose HMD-Poser, the first unified approach to recover full-body motions using scalable sparse observations from HMD and body-worn IMUs. In particular, it can support a variety of input scenarios, such as HMD, HMD+2IMUs, HMD+3IMUs, etc. The scalability of inputs may accommodate users' choices for both high tracking accuracy and easy-to-wear. A lightweight temporal-spatial feature learning network is proposed in HMD-Poser to guarantee that the model runs in real-time on HMDs. Furthermore, HMD-Poser presents online body shape estimation to improve the position accuracy of body joints. Extensive experimental results on the challenging AMASS dataset show that HMD-Poser achieves new state-of-the-art results in both accuracy and real-time performance. We also build a new free-dancing motion dataset to evaluate HMD-Poser's on-device performance and investigate the performance gap between synthetic data and real-captured sensor data. Finally, we demonstrate our HMD-Poser with a real-time Avatar-driving application on a commercial HMD. Our code and free-dancing motion dataset are available https://pico-ai-team.github.io/hmd-poser
- Optitrack motion systems. https://optitrack.com/.
- Unrealego: A new dataset for robust egocentric 3d human motion capture. In Proceedings of the European Conference on Computer Vision, pages 1–17. Springer, 2022.
- Flag: Flow-based 3d avatar generation from sparse observations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13253–13262, 2022.
- Hmd-nemo: Online 3d avatar motion generation from sparse observations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9622–9631, 2023.
- Real-time rgbd-based extended body pose estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2807–2816, 2021.
- Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In Proceedings of the European Conference on Computer Vision, pages 561–578. Springer, 2016.
- Learning variational motion prior for video-based motion capture. arXiv preprint arXiv:2210.15134, 2022.
- Full-body motion from a single head-mounted device: Generating smpl poses from partial observations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11687–11697, 2021.
- Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 481–490, 2023.
- Trajectory optimization for physics-based reconstruction of 3d human pose from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13106–13115, 2022.
- Soma: Solving optical marker-based mocap automatically. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11117–11126, 2021.
- CMU graphics lab. Cmu graphics lab motion capture database. http://mocap.cs.cmu.edu/, 2000.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Neural mocon: Neural motion control for physically plausible human motion capture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6417–6426, 2022.
- Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time. ACM Transactions on Graphics, 37(6):1–15, 2018.
- Avatarposer: Articulated full-body pose tracking from sparse motion sensing. In Proceedings of the European Conference on Computer Vision, pages 443–460, 2022a.
- Transformer inertial poser: Attention-based real-time human motion reconstruction from sparse imus. arXiv preprint arXiv:2203.15720, 2022b.
- Exemplar fine-tuning for 3d human pose fitting towards in-the-wild 3d human pose estimation. arXiv preprint arXiv:2004.03686, 2020.
- End-to-end recovery of human shape and pose. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7122–7131, 2018.
- Learning 3d human dynamics from video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Vibe: Video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
- Pare: Part attention regressor for 3d human body estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11127–11137, 2021.
- Questenvsim: Environment-aware simulated motion tracking from sparse sensors. arXiv preprint arXiv:2306.05666, 2023.
- Niki: Neural inverse kinematics with invertible neural networks for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12933–12942, 2023a.
- Ego-body pose estimation via ego-head pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17142–17151, 2023b.
- Mhformer: Multi-hypothesis transformer for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13147–13156, 2022.
- 3d human pose and shape estimation through collaborative learning and multi-view model-fitting. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1888–1897, 2021.
- Mosh: motion and shape capture from sparse markers. ACM Transactions on Graphics., 33(6):220–1, 2014.
- Smpl: A skinned multi-person linear model. ACM Transactions on Graphics, 34(6):1–16, 2015.
- Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5442–5451, 2019.
- Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021.
- Multiview-consistent semi-supervised learning for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6907–6916, 2020.
- Mocap database hdm05. Institut für Informatik II, Universität Bonn, 2(7), 2007.
- An rnn-ensemble approach for real time human pose estimation from sparse imus. In Proceedings of the 3rd International Conference on Applications of Intelligent Systems, pages 1–6, 2020.
- Fusing monocular images and sparse imu signals for real-time human motion capture. arXiv preprint arXiv:2309.00310, 2023.
- Expressive body capture: 3d hands, face, and body from a single image. In Proceedings IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10975–10985, 2019.
- Sparseposer: Real-time full-body motion reconstruction from sparse data. ACM Transactions on Graphics, 43(1):1–14, 2023.
- Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1-2):4, 2010.
- Robustfusion: Human volumetric capture with data-driven visual cues using a rgbd camera. In Proceedings of the European Conference on Computer Vision, pages 246–264. Springer, 2020.
- Robustfusion: Robust volumetric performance reconstruction under human-object interactions from monocular rgbd stream. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(5):6196–6213, 2022.
- 3d human pose estimation via intuitive physics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4713–4725, 2023.
- Nikolaus F Troje. Decomposing biological motion: A framework for analysis and synthesis of human gait patterns. Journal of Vision, 2(5):2–2, 2002.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Sparse inertial poser: Automatic 3d human pose estimation from sparse imus. In Computer Graphics Forum, pages 349–360. Wiley Online Library, 2017.
- Estimating egocentric 3d human pose in global space. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11500–11509, 2021.
- Scene-aware egocentric 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13031–13040, 2023.
- Questsim: Human motion tracking from sparse sensors with simulated avatars. In SIGGRAPH Asia, pages 1–8, 2022.
- Unstructuredfusion: Realtime 4d geometry and texture reconstruction using commercialrgbd cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2019.
- Transpose: Real-time 3d human translation and pose estimation with six inertial sensors. ACM Transactions on Graphics, 40(4):1–13, 2021.
- Physical inertial poser (pip): Physics-aware real-time human motion tracking from sparse inertial sensors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13167–13178, 2022.
- Egolocate: Real-time motion capture, localization, and mapping with sparse body-mounted sensors. arXiv preprint arXiv:2305.01599, 2023.
- Doublefusion: Real-time capture of human performances with inner body shapes from a single depth sensor. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7287–7296, 2018.
- Thundr: Transformer-based 3d human reconstruction with markers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12971–12980, 2021.
- Ray3d: ray-based 3d human pose estimation for monocular absolute 3d localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13116–13125, 2022.
- Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13232–13242, 2022a.
- Voxeltrack: Multi-person 3d human pose estimation and tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2613–2626, 2022b.
- 3d human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11656–11665, 2021.
- Realistic full-body tracking from sparse observations via joint-level modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14678–14688, 2023.
- On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2019.