NToP: NeRF-Powered Large-scale Dataset Generation for 2D and 3D Human Pose Estimation in Top-View Fisheye Images (2402.18196v2)
Abstract: Human pose estimation (HPE) in the top-view using fisheye cameras presents a promising and innovative application domain. However, the availability of datasets capturing this viewpoint is extremely limited, especially those with high-quality 2D and 3D keypoint annotations. Addressing this gap, we leverage the capabilities of Neural Radiance Fields (NeRF) technique to establish a comprehensive pipeline for generating human pose datasets from existing 2D and 3D datasets, specifically tailored for the top-view fisheye perspective. Through this pipeline, we create a novel dataset NToP570K (NeRF-powered Top-view human Pose dataset for fisheye cameras with over 570 thousand images), and conduct an extensive evaluation of its efficacy in enhancing neural networks for 2D and 3D top-view human pose estimation. A pretrained ViTPose-B model achieves an improvement in AP of 33.3 % on our validation set for 2D HPE after finetuning on our training set. A similarly finetuned HybrIK-Transformer model gains 53.7 mm reduction in PA-MPJPE for 3D HPE on the validation set.
- imghum: Implicit generative models of 3d human shape and articulated pose. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5461–5470, 2021.
- 2d human pose estimation: New benchmark and state of the art analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
- Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017.
- Gm-nerf: Learning generalizable model-based neural radiance fields from multi-view images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20648–20658, 2023a.
- Fast-snarf: A fast deformer for articulated neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–15, 2023b.
- Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In CVPR, 2020.
- Generalizable neural performer: Learning robust radiance fields for human novel view synthesis, 2022.
- Dna-rendering: A diverse neural actor repository for high-fidelity human-centric rendering, 2023.
- Human behavior analysis: a survey on action recognition. Applied Sciences, 11(18):8324, 2021.
- A fall detection algorithm for indoor video sequences captured by fish-eye camera. In 2015 IEEE 15th International Conference on Bioinformatics and Bioengineering (BIBE), pages 1–5, 2015.
- Verification and regularization method for 3d-human body pose estimation based on prior knowledge. Electronic Imaging, 33:1–8, 2021.
- A review of state-of-the-art techniques for abnormal human activity recognition. Engineering Applications of Artificial Intelligence, 77:21–45, 2019.
- Rapid: rotation-aware people detection in overhead fisheye images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 636–637, 2020.
- Deca: Deep viewpoint-equivariant human pose estimation using capsule autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11677–11686, 2021a.
- Panoptop: A framework for generating viewpoint-invariant human pose estimation datasets. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 234–242, 2021b.
- Learning neural volumetric representations of dynamic humans in minutes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8759–8770, 2023.
- Vid2avatar: 3d avatar reconstruction from videos in the wild via self-supervised scene decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12858–12868, 2023.
- A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):87–110, 2023.
- Towards viewpoint invariant 3d human pose estimation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 160–177. Springer, 2016.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Large area 3d human pose detection via stereo reconstruction in panoramic cameras. arXiv preprint arXiv:1907.00534, 2019.
- Sherf: Generalizable human nerf from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9352–9364, 2023.
- Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, 2014a.
- Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, 2014b.
- Learnable triangulation of human pose. In International Conference on Computer Vision (ICCV), 2019.
- Instantavatar: Learning avatars from monocular video in 60 seconds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16922–16932, 2023.
- Neuman: Neural human radiance field from a single video. In Proceedings of the European conference on computer vision (ECCV), 2022.
- Panoptic studio: A massively multiview system for social motion capture. In The IEEE International Conference on Computer Vision (ICCV), 2015.
- Segment anything. arXiv:2304.02643, 2023.
- Real-time fall detection using uncalibrated fisheye cameras. IEEE Transactions on Cognitive and Developmental Systems, 12(3):588–600, 2019.
- Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3383–3393, 2021a.
- Tava: Template-free animatable volumetric actors. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXII, page 419–436, Berlin, Heidelberg, 2022. Springer-Verlag.
- Tokenpose: Learning keypoint tokens for human pose estimation. In IEEE/CVF International Conference on Computer Vision (ICCV), 2021b.
- Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
- Recent advances of monocular 2d and 3d human pose estimation: A deep learning perspective. ACM Comput. Surv., 55(4), 2022.
- SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015.
- MoSh: Motion and shape capture from sparse markers. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 33(6):220:1–220:13, 2014.
- AMASS: Archive of motion capture as surface shapes. In International Conference on Computer Vision, pages 5442–5451, 2019.
- A simple yet effective baseline for 3d human pose estimation. In ICCV, 2017.
- Monocular 3d human pose estimation in the wild using improved cnn supervision. In 3D Vision (3DV), 2017 Fifth International Conference on. IEEE, 2017.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- Actorsnerf: Animatable few-shot human rendering with generalizable nerfs. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 18391–18401, 2023.
- Stacked hourglass networks for human pose estimation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, pages 483–499. Springer, 2016.
- Associative embedding: End-to-end learning for joint detection and grouping. Advances in neural information processing systems, 30, 2017.
- Incorporation of panoramic view in fall detection using omnidirectional camera. In The International Conference on Intelligent Systems & Networks, pages 313–318. Springer, 2021.
- Neural articulated radiance field. In International Conference on Computer Vision, 2021.
- Boris N. Oreshkin. 3d human pose and shape estimation via hybrik-transformer, 2023.
- Expressive body capture: 3d hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019.
- Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In CVPR, 2021.
- Deepcut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4929–4937, 2016.
- Omniflow: Human omnidirectional optical flow. In The Second OmniCV Workshop: Omnidirectional Computer Vision in Research and Industry, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Pliks: A pseudo-linear inverse kinematic solver for 3d human body estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 574–584, 2023.
- Unsupervised volumetric animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4658–4669, 2023.
- Contactless interactive fall detection and sleep quality estimation for supporting elderly with incipient dementia. Current Directions in Biomedical Engineering, 6(3):388–391, 2020.
- A-nerf: Articulated neural radiance fields for learning human shape, appearance, and pose. In Advances in Neural Information Processing Systems, 2021.
- Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Wepdtof: A dataset and benchmark algorithms for in-the-wild people detection and tracking from overhead fisheye cameras. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 503–512, 2022.
- Recovering 3d human mesh from monocular images: A survey, 2022.
- Recovering accurate 3d human pose in the wild using imus and a moving camera. In European Conference on Computer Vision (ECCV), 2018.
- Clothed human performance capture with a double-layer neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21098–21107, 2023.
- HumanNeRF: Free-viewpoint rendering of moving people from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16210–16220, 2022.
- Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV), pages 466–481, 2018.
- Transformer for skeleton-based action recognition: A review of recent advances. Neurocomputing, 2023.
- Ghum & ghuml: Generative 3d human shape and articulated pose models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6184–6193, 2020.
- H-nerf: Neural radiance fields for rendering and temporal reconstruction of humans in motion. In Advances in Neural Information Processing Systems, pages 14955–14966. Curran Associates, Inc., 2021.
- Vitpose: Simple vision transformer baselines for human pose estimation. In Advances in Neural Information Processing Systems, 2022.
- Transpose: Keypoint localization via transformer. In IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- Omnipd: One-step person detection in top-view omnidirectional indoor scenes. Current Directions in Biomedical Engineering, 5(1):239–244, 2019.
- Applications of deep learning for top-view omnidirectional imaging: A survey, 2023a.
- Human pose estimation in monocular omnidirectional top-view images, 2023b.
- Hrformer: High-resolution transformer for dense prediction. 2021.
- Avatarrex: Real-time expressive full-body avatars. ACM Transactions on Graphics (TOG), 42(4), 2023.
- Motionbert: A unified perspective on learning human motion representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.