MMVP: A Multimodal MoCap Dataset with Vision and Pressure Sensors (2403.17610v2)
Abstract: Foot contact is an important cue for human motion capture, understanding, and generation. Existing datasets tend to annotate dense foot contact using visual matching with thresholding or incorporating pressure signals. However, these approaches either suffer from low accuracy or are only designed for small-range and slow motion. There is still a lack of a vision-pressure multimodal dataset with large-range and fast human motion, as well as accurate and dense foot-contact annotation. To fill this gap, we propose a Multimodal MoCap Dataset with Vision and Pressure sensors, named MMVP. MMVP provides accurate and dense plantar pressure signals synchronized with RGBD observations, which is especially useful for both plausible shape estimation, robust pose fitting without foot drifting, and accurate global translation tracking. To validate the dataset, we propose an RGBD-P SMPL fitting method and also a monocular-video-based baseline framework, VP-MoCap, for human motion capture. Experiments demonstrate that our RGBD-P SMPL Fitting results significantly outperform pure visual motion capture. Moreover, VP-MoCap outperforms SOTA methods in foot-contact and global translation estimation accuracy. We believe the configuration of the dataset and the baseline frameworks will stimulate the research in this direction and also provide a good reference for MoCap applications in various domains. Project page: https://metaverse-ai-lab-thu.github.io/MMVP-Dataset/.
- Video based reconstruction of 3d people models. In CVPR, pages 8387–8397, Salt Lake City, USA, 2018. IEEE.
- Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.
- Faust: Dataset and evaluation for 3d mesh registration. In CVPR, pages 3794–3801, Columbus, Ohio, 2014. IEEE.
- Keep it smpl: Automatic estimation of 3D human pose and shape from a single image. In ECCV, pages 561–578, Amsterdam, 2016. Springer.
- Smpler-x: Scaling up expressive human pose and shape estimation. arXiv preprint arXiv:2309.17448, 2023.
- Detecting human-object contact in images. In CVPR, pages 17100–17110, 2023.
- Hsc4d: Human-centered 4d scene capture in large-scale indoor-outdoor space using wearable imus and lidar. In CVPR, pages 6792–6802, 2022.
- Sloper4d: A scene-aware dataset for global 4d human pose estimation in urban environments. In CVPR, pages 682–692, 2023.
- Collaborative regression of expressive bodies using moderation. In 2021 International Conference on 3D Vision (3DV), pages 792–804. IEEE, 2021.
- Learning complex 3d human self-contact. In AAAI, pages 1343–1351, 2021.
- A dataset of asymptomatic human gait and movements obtained from markers, imus, insoles and force plates. Scientific Data, 10(1):180, 2023.
- Holopose: Holistic 3d human reconstruction in-the-wild. In CVPR, pages 10884–10894, 2019.
- Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors. In CVPR, pages 4318–4329, 2021.
- Resolving 3d human pose ambiguities with 3d scene constraints. In ICCV, pages 2282–2292, Seoul Korea, 2019. IEEE.
- Populating 3d scenes by learning human-scene interaction. In CVPR, pages 14708–14718, 2021.
- Structured contact force optimization for kino-dynamic motion generation. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2703–2710. IEEE, 2016.
- Learning an infant body model from rgb-d data for accurate full body motion analysis. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part I, pages 792–800. Springer, 2018.
- Capturing and inferring dense full-body human-scene contact. In CVPR, pages 13274–13285, 2022.
- Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE TPAMI, 36(7):1325–1339, 2014.
- Rtmpose: Real-time multi-person pose estimation based on mmpose. arXiv preprint arXiv:2303.07399, 2023.
- Panoptic studio: A massively multiview system for social motion capture. In ICCV, pages 3334–3342, 2015.
- Total capture: A 3d deformation model for tracking faces, hands, and bodies. In CVPR, pages 8320–8329, 2018.
- End-to-end recovery of human shape and pose. In CVPR, pages 7122–7131, 2018.
- Emdb: The electromagnetic database of global 3d human pose and shape in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14632–14643, 2023.
- Vibe: Video inference for human body pose and shape estimation. In CVPR, pages 5252–5262, Virtual, 2020. IEEE.
- Pare: Part attention regressor for 3d human body estimation. In ICCV, pages 11127–11137, 2021.
- Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In ICCV, pages 2252–2261, Seoul Korea, 2019. IEEE.
- Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In CVPR, pages 3383–3393, 2021.
- Lidarcap: Long-range marker-less 3d human motion capture with lidar point clouds. In CVPR, pages 20502–20512, 2022a.
- Markerless shape and motion capture from multiview video sequences. IEEE Transactions on Circuits and systems for Video Technology, 21(3):320–334, 2011.
- Cliff: Carrying location information in full frames into human pose and shape estimation. In ECCV, pages 590–606. Springer, 2022b.
- Markerless motion capture of multiple characters using multiview image segmentation. IEEE TPAMI, 35(11):2720–2735, 2013.
- Smpl: A skinned multi-person linear model. ACM TOG, 34(6):1–16, 2015.
- Amass: Archive of motion capture as surface shapes. In ICCV, pages 5441–5450, Long Beach, 2019. IEEE.
- On self-contact and human pose. In CVPR, pages 9990–9999, 2021.
- Fusing monocular images and sparse imu signals for real-time human motion capture. In SIGGRAPH Asia, Sydney, 2023. ACM.
- Agora: Avatars in geography optimized for regression analysis. In CVPR, pages 13468–13478, 2021.
- Learning to estimate 3d human pose and shape from a single color image. In CVPR, pages 459–468, 2018.
- Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019a.
- Expressive body capture: 3d hands, face, and body from a single image. In CVPR, pages 10975–10985, 2019b.
- Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In CVPR, pages 9054–9063, Virtual, 2021. IEEE.
- Contact and human dynamics from monocular video. In ECCV, pages 71–87. Springer, 2020.
- A multimodal dataset of human gait at different walking speeds established on injury-free adult participants. Scientific data, 6(1):111, 2019.
- From image to stability: Learning dynamics from human pose. In ECCV, pages 536–554. Springer, 2020.
- Physcap: Physically plausible monocular 3d motion capture in real time. ACM TOG, 39(6):1–16, 2020.
- Neural monocular 3d human motion capture with physical awareness. ACM TOG, 40(4):1–15, 2021.
- Monocular, one-stage, regression of multiple 3d people. In ICCV, pages 11179–11188, 2021.
- Trace: 5d temporal regression of avatars with dynamic cameras in 3d environments. In CVPR, pages 8856–8866, 2023.
- Deco: Dense estimation of 3d human-scene contact in the wild. In ICCV, pages 8001–8013, 2023a.
- 3d human pose estimation via intuitive physics. In CVPR, pages 4713–4725, 2023b.
- Total capture: 3d human pose estimation fusing video and inertial sensors. In Proceedings of 28th British Machine Vision Conference, pages 1–13, 2017.
- A biomechanics dataset of healthy human walking at various speeds, step lengths and step widths. Scientific data, 9(1):704, 2022.
- Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European conference on computer vision (ECCV), pages 601–617, 2018.
- Cimi4d: A large multimodal climbing motion dataset under human-scene interactions. In CVPR, pages 12977–12988, 2023.
- Physical inertial poser (pip): Physics-aware real-time human motion tracking from sparse inertial sensors. In CVPR, pages 13167–13178, 2022.
- Egolocate: Real-time motion capture, localization, and mapping with sparse body-mounted sensors. ACM TOG, 42(4), 2023.
- Doublefusion: Real-time capture of human performances with inner body shapes from a single depth sensor. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7287–7296, 2018.
- Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors. In CVPR, pages 5746–5756, Virtual, 2021. IEEE.
- Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In ICCV, pages 11446–11456, 2021a.
- Leveraging depth cameras and wearable pressure sensors for full-body kinematics and dynamics capture. ACM TOG, 33(6):1–14, 2014.
- Learning motion priors for 4d human body capture in 3d scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11343–11353, 2021b.
- Object-occluded human shape and pose estimation from a single color image. In CVPR, pages 7376–7385, Virtual, 2020. IEEE.
- Lightweight multi-person total motion capture using sparse multi-view cameras. In ICCV, pages 5560–5569, 2021c.
- Hybridfusion: Real-time performance capture using a single depth sensor and sparse imus. In ECCV, pages 384–400, 2018.