Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Geometry-Biased Transformer for Robust Multi-View 3D Human Pose Reconstruction (2312.17106v1)

Published 28 Dec 2023 in cs.CV

Abstract: We address the challenges in estimating 3D human poses from multiple views under occlusion and with limited overlapping views. We approach multi-view, single-person 3D human pose reconstruction as a regression problem and propose a novel encoder-decoder Transformer architecture to estimate 3D poses from multi-view 2D pose sequences. The encoder refines 2D skeleton joints detected across different views and times, fusing multi-view and temporal information through global self-attention. We enhance the encoder by incorporating a geometry-biased attention mechanism, effectively leveraging geometric relationships between views. Additionally, we use detection scores provided by the 2D pose detector to further guide the encoder's attention based on the reliability of the 2D detections. The decoder subsequently regresses the 3D pose sequence from these refined tokens, using pre-defined queries for each joint. To enhance the generalization of our method to unseen scenes and improve resilience to missing joints, we implement strategies including scene centering, synthetic views, and token dropout. We conduct extensive experiments on three benchmark public datasets, Human3.6M, CMU Panoptic and Occlusion-Persons. Our results demonstrate the efficacy of our approach, particularly in occluded scenes and when few views are available, which are traditionally challenging scenarios for triangulation-based methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Carnegie Mellon University Graphics Lab Motion Capture Database. http://mocap.cs.cmu.edu, 2003.
  2. 3d pictorial structures for multiple view articulated pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013.
  3. End-to-End object detection with transformers. In A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, editors, Computer Vision – ECCV 2020, pages 213–229, Cham, 2020. Springer International Publishing.
  4. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  5. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 489–500, Brussels, Belgium, Oct.-Nov. 2018. Association for Computational Linguistics.
  6. YOLOX: Exceeding YOLO series in 2021. July 2021.
  7. R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, New York, NY, USA, 2 edition, 2003.
  8. Epipolar transformers. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 7779–7788, 2020.
  9. Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell., 36(7):1325–1339, July 2014.
  10. Learnable triangulation of human pose. In International Conference on Computer Vision (ICCV), 2019.
  11. Perceiver IO: A general architecture for structured inputs & outputs. In International Conference on Learning Representations, 2022.
  12. Y.-B. Jia. Plücker coordinates for lines in the space. Problem Solver Techniques for Applied Computer Science, Com-S-477/577 Course Handout, 3, 2020.
  13. Panoptic studio: A massively multiview system for social interaction capture. IEEE transactions on, Dec. 2016.
  14. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Y. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  15. End-to-end human pose and mesh reconstruction with transformers. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2021.
  16. Patchdropout: Economizing vision transformers using patch dropout. ArXiv, abs/2208.07220, 2022.
  17. Transfusion: Cross-view fusion with transformer for 3d human pose estimation. In British Machine Vision Conference, 2021.
  18. PPT: token-pruned pose transformer for monocular and multi-view human pose estimation. In European Conference on Computer Vision, 2022.
  19. AMASS: Archive of motion capture as surface shapes. In International Conference on Computer Vision, pages 5442–5451, Oct. 2019.
  20. Poseur: Direct human pose regression with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), October 2022.
  21. A simple yet effective baseline for 3d human pose estimation. In ICCV, 2017.
  22. NeRF: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  23. F. Moreno-Noguer. 3d human pose estimation from a single image via distance matrix regression. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1561–1570, Los Alamitos, CA, USA, jul 2017. IEEE Computer Society.
  24. Harvesting multiple views for marker-less 3d human pose annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  25. 3D human pose estimation in video with temporal convolutions and semi-supervised training. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2019.
  26. J. Plücker. On a New Geometry of Space. Royal Society, 1865.
  27. Cross view fusion for 3D human pose estimation. ICCV 2019, 2019.
  28. UnrealCV: Virtual worlds for computer vision. In Proceedings of the 25th ACM international conference on Multimedia, MM ’17, pages 1221–1224, New York, NY, USA, Oct. 2017. Association for Computing Machinery.
  29. Lightweight multi-view 3d pose estimation through camera-disentangled representation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6039–6048, jun 2020.
  30. Learning monocular 3d human pose estimation from multi-view images. In IEEE/CVF 2018 Conference on Computer Vision and Pattern Recognition. IEEE, June 2018.
  31. How robust is 3D human pose estimation to occlusion? Aug. 2018.
  32. Light field networks: Neural scene representations with Single-Evaluation rendering. In Proc. NeurIPS, 2021.
  33. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5693–5703, 2019.
  34. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  35. Geometry-biased transformers for novel view synthesis. Jan. 2023.
  36. Probabilistic monocular 3D human pose estimation with normalizing flows. In International Conference on Computer Vision (ICCV), 2021.
  37. H. Xia and Q. Zhang. VitPose: Multi-view 3D human pose estimation with vision transformer. In 2022 IEEE 8th International Conference on Computer and Communications (ICCC), pages 1922–1927, Dec. 2022.
  38. Metafuse: A pre-trained fusion model for human pose estimation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13683–13692, Los Alamitos, CA, USA, jun 2020. IEEE Computer Society.
  39. Transpose: Keypoint localization via transformer. In IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  40. Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13232–13242, June 2022.
  41. Object-occluded human shape and pose estimation from a single color image. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2020.
  42. Adafuse: Adaptive multiview fusion for accurate human pose estimation in the wild. International Journal of Computer Vision, 129:703 – 718, 2020.
Citations (1)

Summary

We haven't generated a summary for this paper yet.