SelfPose3d: Self-Supervised Multi-Person Multi-View 3d Pose Estimation (2404.02041v2)
Abstract: We present a new self-supervised approach, SelfPose3d, for estimating 3d poses of multiple persons from multiple camera views. Unlike current state-of-the-art fully-supervised methods, our approach does not require any 2d or 3d ground-truth poses and uses only the multi-view input images from a calibrated camera setup and 2d pseudo poses generated from an off-the-shelf 2d human pose estimator. We propose two self-supervised learning objectives: self-supervised person localization in 3d space and self-supervised 3d pose estimation. We achieve self-supervised 3d person localization by training the model on synthetically generated 3d points, serving as 3d person root positions, and on the projected root-heatmaps in all the views. We then model the 3d poses of all the localized persons with a bottleneck representation, map them onto all views obtaining 2d joints, and render them using 2d Gaussian heatmaps in an end-to-end differentiable manner. Afterwards, we use the corresponding 2d joints and heatmaps from the pseudo 2d poses for learning. To alleviate the intrinsic inaccuracy of the pseudo labels, we propose an adaptive supervision attention mechanism to guide the self-supervision. Our experiments and analysis on three public benchmark datasets, including Panoptic, Shelf, and Campus, show the effectiveness of our approach, which is comparable to fully-supervised methods. Code: https://github.com/CAMMA-public/SelfPose3D. Video demo: https://youtu.be/GAqhmUIr2E8.
- 3d pictorial structures for multiple human pose estimation. In CVPR, pages 1669–1676, 2014.
- 3d pictorial structures revisited: Multiple human pose estimation. TPAMI, 38(10):1929–1942, 2015.
- Loopreg: Self-supervised learning of implicit surface correspondences, pose and shape for 3d human mesh registration. Advances in Neural Information Processing Systems, 33:12909–12922, 2020.
- Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In ECCV, pages 561–578. Springer, 2016.
- Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, pages 7291–7299, 2017.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021.
- Gradient descent optimization of smoothed information retrieval metrics. Information retrieval, 13(3):216–235, 2010.
- Unsupervised 3d pose estimation with geometric self-supervision. CoRR, abs/1904.04812, 2019a.
- Multi-person 3d pose estimation in crowded scenes based on multi-view geometry. In European Conference on Computer Vision, pages 541–557. Springer, 2020a.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020b.
- Learning to predict 3d objects with an interpolation-based differentiable renderer. Advances in neural information processing systems, 32, 2019b.
- Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7103–7112, 2018.
- Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In CVPR, 2020.
- Tempo: Efficient multi-view pose estimation, tracking, and forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14750–14760, 2023.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020.
- Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
- Fast and robust multi-person 3d pose estimation from multiple views. In CVPR, pages 7792–7801, 2019.
- Can 3d pose be learned from 2d projections alone? In Proceedings of the European Conference on Computer Vision (ECCV), pages 0–0, 2018.
- Multiple human 3d pose estimation from multiview images. Multimedia Tools and Applications, 77(12):15573–15601, 2018.
- Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2334–2343, 2017.
- Poseaug: A differentiable pose augmentation framework for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8575–8584, 2021.
- Flex: Parameter-free multi-view 3d human motion reconstruction. arXiv preprint arXiv:2105.01937, 2021.
- Multiple view geometry in computer vision. Cambridge university press, 2003.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020a.
- Epipolar transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7779–7788, 2020b.
- Learnable triangulation of human pose. arXiv preprint arXiv:1905.05754, 2019.
- Unsupervised learning of object landmarks through conditional image generation. Advances in neural information processing systems, 31, 2018.
- Self-supervised learning of interpretable keypoints from unlabelled videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8787–8797, 2020.
- Map visibility estimation for large-scale dynamic 3d reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1122–1129, 2014.
- Panoptic studio: A massively multiview system for social motion capture. In Proceedings of the IEEE International Conference on Computer Vision, pages 3334–3342, 2015.
- A generalizable approach for multi-view 3d human pose regression. Machine Vision and Applications, 32(1):1–14, 2021.
- Self-supervised learning of 3d human pose using multi-view geometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1077–1086, 2019.
- Pifpaf: Composite fields for human pose estimation. In CVPR, pages 11977–11986, 2019.
- Unsupervised adversarial learning of 3d human pose from 2d joint locations. arXiv preprint arXiv:1803.08244, 2018.
- Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
- Self-supervised 3d human pose estimation via part guided novel image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6152–6162, 2020.
- Multi-view multi-person 3d pose estimation with plane sweep stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11886–11895, 2021.
- Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
- Smpl: A skinned multi-person linear model. ACM transactions on graphics (TOG), 34(6):1–16, 2015.
- Fcpose: Fully convolutional multi-person pose estimation with dynamic instance-aware convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9034–9043, 2021.
- A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2640–2649, 2017.
- Evopose2d: Pushing the boundaries of 2d human pose estimation using neuroevolution. arXiv preprint arXiv:2011.08446, 2020.
- Vnect: Real-time 3d human pose estimation with a single rgb camera. ACM Transactions on Graphics (TOG), 36(4):1–14, 2017.
- Associative embedding: End-to-end learning for joint detection and grouping. In NIPS, pages 2277–2287, 2017.
- Single-stage multi-person pose machines. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6951–6960, 2019.
- Harvesting multiple views for marker-less 3d human pose annotations. In CVPR, pages 1253–1262, 2017.
- Domes to drones: Self-supervised active triangulation for 3d human pose reconstruction. Advances in Neural Information Processing Systems, 32, 2019.
- Deep multitask architecture for integrated 2d and 3d human sensing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6289–6298, 2017.
- Cross view fusion for 3d human pose estimation. In ICCV, pages 4342–4351, 2019.
- Tessetrack: End-to-end learnable multi-person articulated 3d pose tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15190–15200, 2021.
- Lightweight multi-view 3d pose estimation through camera-disentangled representation. In CVPR, pages 6040–6049, 2020.
- Byol works even without batch statistics. arXiv preprint arXiv:2010.10241, 2020.
- Deep high-resolution representation learning for human pose estimation. In CVPR, pages 5693–5703, 2019.
- Integral human pose regression. In ECCV, pages 529–545, 2018.
- Voxelpose: Towards multi-camera 3d human pose estimation in wild environment. In European Conference on Computer Vision, pages 197–212. Springer, 2020.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Graph-based 3d multi-person pose estimation using multi-view images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11148–11157, 2021.
- Simple baselines for human pose estimation and tracking. In ECCV, pages 466–481, 2018.
- Distribution-aware coordinate representation for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7093–7102, 2020a.
- Inference stage optimization for cross-scenario 3d human pose estimation. Advances in Neural Information Processing Systems, 33:2408–2419, 2020b.
- Direct multi-view multi-person 3d pose estimation. Advances in Neural Information Processing Systems, 34, 2021a.
- Voxeltrack: Multi-person 3d human pose estimation and tracking in the wild. arXiv preprint arXiv:2108.02452, 2021b.
- Towards 3d human pose estimation in the wild: a weakly-supervised approach. In ICCV, 2017.
- Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
- Vinkle Srivastav (23 papers)
- Keqi Chen (6 papers)
- Nicolas Padoy (93 papers)