Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation (2312.07051v2)
Abstract: Automatic estimation of 3D human pose from monocular RGB images is a challenging and unsolved problem in computer vision. In a supervised manner, approaches heavily rely on laborious annotations and present hampered generalization ability due to the limited diversity of 3D pose datasets. To address these challenges, we propose a unified framework that leverages mask as supervision for unsupervised 3D pose estimation. With general unsupervised segmentation algorithms, the proposed model employs skeleton and physique representations that exploit accurate pose information from coarse to fine. Compared with previous unsupervised approaches, we organize the human skeleton in a fully unsupervised way which enables the processing of annotation-free data and provides ready-to-use estimation results. Comprehensive experiments demonstrate our state-of-the-art pose estimation performance on Human3.6M and MPI-INF-3DHP datasets. Further experiments on in-the-wild datasets also illustrate the capability to access more data to boost our model. Code will be available at https://github.com/Charrrrrlie/Mask-as-Supervision.
- https://www.remove.bg/.
- 2d human pose estimation: New benchmark and state of the art analysis. In IEEE Conference on Computer Vision and Pattern Recognition, 2014.
- Graph cuts and efficient nd image segmentation. International Journal of Computer Vision, 70(2):109–131, 2006.
- Unsupervised 3d pose estimation with geometric self-supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5714–5724, 2019.
- Guess what moves: Unsupervised video and image segmentation by anticipating motion. In British Machine Vision Conference, 2022.
- Emily L Denton et al. Unsupervised learning of disentangled representations from video. Advances in Neural Information Processing Systems, 30, 2017.
- Dynamic textures. International journal of computer vision, 51:91–109, 2003.
- Poseaug: A differentiable pose augmentation framework for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8575–8584, 2021.
- Vr-handnet: A visually and physically plausible hand manipulation system in virtual reality. IEEE Transactions on Visualization and Computer Graphics, 2023.
- Multiple view geometry in computer vision. Cambridge University Press, 2003.
- Autolink: Self-supervised learning of human skeletons and object outlines by linking keypoints. Advances in Neural Information Processing Systems, 35:36123–36141, 2022.
- Few-shot geometry-aware keypoint localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21337–21348, 2023.
- Human–robot interaction in industrial collaborative robotics: a literature review of the decade 2008–2017. Advanced Robotics, 33(15-16):764–799, 2019.
- Unsupervised 3d keypoint estimation with multi-view geometry. arXiv preprint arXiv:2211.12829, 2022.
- Temporal representation learning on monocular videos for 3d human pose estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456. PMLR, 2015.
- Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, 2013.
- Learning high fidelity depths of dressed humans by watching social media dance videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12753–12762, 2021.
- Unsupervised learning of object landmarks through conditional image generation. Advances in Neural Information Processing Systems, 31, 2018.
- Self-supervised learning of interpretable keypoints from unlabelled videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8787–8797, 2020.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
- Self-supervised learning of 3d human pose using multi-view geometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1077–1086, 2019.
- Self-supervised 3d human pose estimation via part guided novel image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6152–6162, 2020a.
- Kinematic-structure-preserved representation for unsupervised 3d human pose estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 11312–11319, 2020b.
- Geometry-driven self-supervised method for 3d human pose estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 11442–11449, 2020.
- Bootstrapping objectness from videos by relaxed common fate and visual grouping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14582–14591, 2023.
- Collision-free human-robot collaboration based on context awareness. Robotics and Computer-Integrated Manufacturing, 67:101997, 2021.
- Arhpe: Asymmetric relation-aware representation learning for head pose estimation in industrial human–computer interaction. IEEE Transactions on Industrial Informatics, 18(10):7107–7117, 2022.
- Smpl: A skinned multi-person linear model. ACM Transactions on Graphics, 34(6):1–16, 2015.
- Unsupervised part-based disentangling of object shape and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10955–10964, 2019.
- Virtual reality in manufacturing: immersive and collaborative artificial-reality in design of human-robot workspace. International Journal of Computer Integrated Manufacturing, 33(1):22–37, 2020.
- Monocular 3d human pose estimation in the wild using improved cnn supervision. In 3D Vision, 2017 Fifth International Conference on. IEEE, 2017.
- Differentiable drawing and sketching. arXiv preprint arXiv:2103.16194, 2021.
- Stacked hourglass networks for human pose estimation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, pages 483–499. Springer, 2016.
- Tax-pose: Task-specific cross-pose estimation for robot manipulation. In Conference on Robot Learning, pages 1783–1792. PMLR, 2023.
- Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9054–9063, 2021.
- Douglas A Reynolds et al. Gaussian mixture models. Encyclopedia of biometrics, 741(659-663), 2009.
- Neural scene decomposition for multi-person motion capture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7703–7713, 2019.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Unsupervised human pose estimation through transforming shape templates. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2484–2494, 2021.
- James A Sethian. Fast marching methods. SIAM review, 41(2):199–235, 1999.
- Activation functions in neural networks. Towards Data Sci, 6(12):310–316, 2017.
- First order motion model for image animation. Advances in Neural Information Processing Systems, 32, 2019.
- Fast and robust video-based exercise classification via body pose tracking and scalable multivariate time series classifiers. Data Mining and Knowledge Discovery, 37(2):873–912, 2023.
- Self-supervised 3d human pose estimation from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4787–4796, 2023.
- Adaptive background mixture models for real-time tracking. In Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), pages 246–252. IEEE, 1999.
- Bkind-3d: Self-supervised 3d keypoint discovery from multi-view videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9001–9010, 2023.
- Integral human pose regression. In Proceedings of the European Conference on Computer Vision, pages 529–545, 2018.
- Discovery of latent 3d keypoints via end-to-end geometric reasoning. Advances in Neural Information Processing Systems, 31, 2018.
- Unsupervised learning of landmarks by descriptor vector exchange. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6361–6371, 2019.
- Pekka J Toivanen. New geodosic distance transforms for gray-scale images. Pattern Recognition Letters, 17(5):437–450, 1996.
- Canonpose: Self-supervised monocular 3d human pose estimation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13294–13304, 2021.
- Ai coach: Deep human pose estimation and analysis for personalized athletic training assistance. In Proceedings of the 27th ACM International Conference on Multimedia, pages 374–382, 2019.
- Group normalization. In Proceedings of the European Conference on Computer Vision, pages 3–19, 2018.
- Wei Xu. Toward human-centered ai: a perspective from human-computer interaction. Interactions, 26(4):42–46, 2019.
- Towards alleviating the modeling ambiguity of unsupervised monocular 3d human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8651–8660, 2021.
- Unsupervised discovery of object landmarks as structural representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2694–2703, 2018.