MoCap-to-Visual Domain Adaptation for Efficient Human Mesh Estimation from 2D Keypoints (2404.07094v1)
Abstract: This paper presents Key2Mesh, a model that takes a set of 2D human pose keypoints as input and estimates the corresponding body mesh. Since this process does not involve any visual (i.e. RGB image) data, the model can be trained on large-scale motion capture (MoCap) datasets, thereby overcoming the scarcity of image datasets with 3D labels. To enable the model's application on RGB images, we first run an off-the-shelf 2D pose estimator to obtain the 2D keypoints, and then feed these 2D keypoints to Key2Mesh. To improve the performance of our model on RGB images, we apply an adversarial domain adaptation (DA) method to bridge the gap between the MoCap and visual domains. Crucially, our DA method does not require 3D labels for visual data, which enables adaptation to target sets without the need for costly labels. We evaluate Key2Mesh for the task of estimating 3D human meshes from 2D keypoints, in the absence of RGB and mesh label pairs. Our results on widely used H3.6M and 3DPW datasets show that Key2Mesh sets the new state-of-the-art by outperforming other models in PA-MPJPE for both datasets, and in MPJPE and PVE for the 3DPW dataset. Thanks to our model's simple architecture, it operates at least 12x faster than the prior state-of-the-art model, LGD. Additional qualitative samples and code are available on the project website: https://key2mesh.github.io/.
- Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European Conference on Computer Vision, pages 561–578. Springer, 2016.
- Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7291–7299, 2017.
- Patient-specific pose estimation in clinical environments. IEEE journal of translational engineering in health and medicine, 6:1–11, 2018.
- Implicit 3d human mesh recovery using consistency with pose and shape from unseen-view. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21148–21158, 2023.
- Cross-attention of disentangled modalities for 3d human mesh recovery with transformers. In European Conference on Computer Vision (ECCV), 2022.
- Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In European Conference on Computer Vision, pages 769–787. Springer, 2020.
- Beyond static features for temporally consistent 3d human pose and shape from a video. In Conference on Computer Vision and Pattern Recognition, 2021.
- Learning to fit morphable models. In European Conference on Computer Vision, pages 160–179. Springer, 2022.
- Self-supervised human mesh recovery with cross-representation alignment. In European Conference on Computer Vision, pages 212–230, 2022.
- Markerless estimation of patient orientation, posture and pose using range and pressure imaging: For automatic patient setup and scanner initialization in tomographic imaging. International journal of computer assisted radiology and surgery, 7:921–929, 2012.
- Bilevel online adaptation for out-of-domain human mesh reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10472–10481, 2021.
- Out-of-domain human mesh reconstruction via dynamic bilevel online adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):5070–5086, 2022.
- Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7297–7306, 2018.
- Improved training of wasserstein gans. Advances in Neural Information Processing Systems, 30, 2017.
- Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4318–4329, 2021.
- Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, 2014.
- Exemplar fine-tuning for 3d human pose fitting towards in-the-wild 3d human pose estimation. In International Conference on 3D Vision, 2020.
- End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7122–7131, 2018.
- Learning 3d human dynamics from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Self-supervised learning of 3d human pose using multi-view geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
- Vibe: Video inference for human body pose and shape estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
- Pare: Part attention regressor for 3d human body estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 11127–11137, 2021.
- Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.
- Probabilistic modeling for human mesh recovery. In Proceedings of the IEEE International Conference on Computer Vision, pages 11605–11614, 2021.
- Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3383–3393, 2021.
- Cliff: Carrying location information in full frames into human pose and shape estimation. In European Conference on Computer Vision, pages 590–606. Springer, 2022.
- End-to-end human pose and mesh reconstruction with transformers. In CVPR, 2021.
- Mpt: Mesh pre-training with transformers for human pose and mesh reconstruction. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3415–3425, 2024.
- SMPL: A skinned multi-person linear model. ACM Transactions on Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015.
- 3d human mesh estimation from virtual markers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 534–543, 2023.
- AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5442–5451, 2019.
- I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In European Conference on Computer Vision, pages 752–768. Springer, 2020.
- Aligning silhouette topology for self-adaptive 3d human pose recovery. Advances in Neural Information Processing Systems, 34:4582–4593, 2021.
- Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 2019.
- Learning to estimate 3d human pose and shape from a single color image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 459–468, 2018.
- Action-conditioned 3d human motion synthesis with transformer vae. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10985–10995, 2021.
- Humor: 3d human motion model for robust pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 11488–11499, 2021.
- A multi-sensor architecture combining human pose estimation and real-time location systems for workflow monitoring on hybrid operating suites. Future Generation Computer Systems, 135:283–298, 2022.
- Synthetic training for accurate 3d human pose and shape estimation in the wild. arXiv preprint arXiv:2009.10013, 2020.
- Hierarchical kinematic probability distributions for 3d human shape and pose estimation from images in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pages 11219–11229, 2021a.
- Probabilistic 3d human shape and pose estimation from multiple unconstrained images in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 16094–16104, 2021b.
- Wasserstein distance guided representation learning for domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
- Human body model fitting by learned gradient descent. In European Conference on Computer Vision, pages 744–760. Springer, 2020.
- Posenet3d: Learning temporally consistent 3d human pose via knowledge distillation. In International Conference on 3D Vision, pages 311–321. IEEE, 2020.
- 3d human pose estimation via intuitive physics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4713–4725, 2023.
- Self-supervised learning of motion capture. Advances in Neural Information Processing Systems, 30, 2017.
- Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7167–7176, 2017.
- Recovering accurate 3d human pose in the wild using imus and a moving camera. In European Conference on Computer Vision, pages 601–617, 2018.
- Photo wake-up: 3d character animation from a single photo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5908–5917, 2019.
- ViTPose: Simple vision transformer baselines for human pose estimation. In Advances in Neural Information Processing Systems, 2022.
- Skeleton2mesh: Kinematics prior injected unsupervised human mesh recovery. In Proceedings of the IEEE International Conference on Computer Vision, pages 8619–8629, 2021.
- Weakly supervised 3d human pose and shape reconstruction with normalizing flows. In European Conference on Computer Vision, pages 465–481. Springer, 2020.
- Neural descent for visual 3d human pose and shape. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 14484–14493, 2021.
- Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In Proceedings of the IEEE International Conference on Computer Vision, 2021a.
- Learning motion priors for 4d human body capture in 3d scenes. In Proceedings of the IEEE International Conference on Computer Vision, pages 11343–11353, 2021b.
- On the continuity of rotation representations in neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2019.