Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption (2404.11291v1)

Published 17 Apr 2024 in cs.CV

Abstract: Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration, but overlook the modeling of close interactions. In this work, we tackle the task of reconstructing closely interactive humans from a monocular video. The main challenge of this task comes from insufficient visual information caused by depth ambiguity and severe inter-person occlusion. In view of this, we propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information. This is based on the observation that human interaction has specific patterns following the social proxemics. Specifically, we first design a latent representation based on Vector Quantised-Variational AutoEncoder (VQ-VAE) to model human interaction. A proxemics and physics guided diffusion model is then introduced to denoise the initial distribution. We design the diffusion model as dual branch with each branch representing one individual such that the interaction can be modeled via cross attention. With the learned priors of VQ-VAE and physical constraint as the additional information, our proposed approach is capable of estimating accurate poses that are also proxemics and physics plausible. Experimental results on Hi4D, 3DPW, and CHI3D demonstrate that our method outperforms existing approaches. The code is available at \url{https://github.com/boycehbz/HumanInteraction}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, pages 3686–3693, 2014.
  2. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In ECCV, 2016.
  3. Multi-person 3d pose and shape estimation via inverse kinematics and refinement. In ECCV, pages 660–677, 2022.
  4. Learning to estimate robust 3d human mesh from in-the-wild crowded scenes. In CVPR, pages 1475–1484, 2022a.
  5. Diffupose: Monocular 3d human pose estimation via denoising diffusion probabilistic model. arXiv preprint arXiv:2212.02796, 2022b.
  6. Interaction transformer for human reaction generation. TMM, 2023.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  8. Monocular 3d reconstruction of interacting hands via collision-aware factorized refinements. In 3DV, 2021.
  9. Pgformer: Proxy-bridged game transformer for multi-person extremely interactive motion prediction. arXiv preprint arXiv:2306.03374, 2023.
  10. Three-dimensional reconstruction of human interactions. In CVPR, pages 7214–7223, 2020.
  11. Reconstructing three-dimensional models of interacting humans. arXiv preprint arXiv:2308.01854, 2023.
  12. Humans in 4d: Reconstructing and tracking humans with transformers. arXiv preprint arXiv:2305.20091, 2023.
  13. Diffpose: Toward more reliable 3d pose estimation. In CVPR, pages 13041–13051, 2023.
  14. Multi-person extreme motion prediction. In CVPR, pages 13053–13064, 2022.
  15. Diffpose: Multi-hypothesis human pose estimation using diffusion models. In ICCV, pages 15977–15987, 2023.
  16. Object-occluded human shape and pose estimation with probabilistic latent consistency. TPAMI, 45(4):5010–5026, 2022a.
  17. Pose2uv: Single-shot multiperson mesh recovery with deep uv prior. TIP, 31:4679–4692, 2022b.
  18. Reconstructing groups of people with hypergraph relational reasoning. In ICCV, pages 14873–14883, 2023a.
  19. Crowdrec: 3d crowd reconstruction from single color images. arXiv preprint arXiv:2310.06332, 2023b.
  20. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. TPAMI, 36(7):1325–1339, 2014.
  21. Coherent reconstruction of multiple humans from a single image. In CVPR, pages 5579–5588, 2020.
  22. End-to-end recovery of human shape and pose. In CVPR, pages 7122–7131, 2018.
  23. Tero Karras. Maximizing parallelism in the construction of bvhs, octrees, and k-d trees. In Proceedings of the Fourth ACM SIGGRAPH / Eurographics Conference on High-Performance Graphics, pages 33–37, 2012.
  24. Occluded human mesh recovery. In CVPR, pages 1715–1725, 2022.
  25. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  26. Vibe: Video inference for human body pose and shape estimation. In CVPR, 2020.
  27. Coordinate transformer: Achieving single-stage multi-person mesh recovery from videos. In ICCV, pages 8744–8753, 2023.
  28. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In CVPR, pages 3383–3393, 2021.
  29. Cliff: Carrying location information in full frames into human pose and shape estimation. In ECCV, pages 590–606, 2022.
  30. Intergen: Diffusion-based multi-human motion generation under complex interactions. arXiv preprint arXiv:2304.05684, 2023.
  31. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
  32. Character controllers using motion vaes. TOG, 39(4):40–1, 2020.
  33. Smpl: A skinned multi-person linear model. TOG, 34(6):1–16, 2015.
  34. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  35. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 3DV, pages 506–516, 2017.
  36. Generative proxemics: A prior for 3d social interaction from images. arXiv preprint arXiv:2306.09337, 2023.
  37. Multi-person implicit reconstruction from a single image. In CVPR, pages 14474–14483, 2021.
  38. Improved denoising diffusion probabilistic models. In ICML, 2021.
  39. Expressive body capture: 3D hands, face, and body from a single image. In CVPR, pages 10975–10985, 2019.
  40. Scalable diffusion models with transformers. In ICCV, pages 4195–4205, 2023.
  41. Trajectory-aware body interaction transformer for multi-person pose forecasting. In CVPR, pages 17121–17130, 2023.
  42. Psvt: End-to-end multi-person 3d pose and shape estimation with progressive video transformers. In CVPR, pages 21254–21263, 2023.
  43. Best practices for 2-body pose forecasting. In CVPR, pages 3613–3623, 2023.
  44. Tracking people by predicting 3d appearance, location and pose. In CVPR, pages 2740–2749, 2022.
  45. Generating diverse high-fidelity images with vq-vae-2. NeurIPS, 32, 2019.
  46. Humor: 3d human motion model for robust pose estimation. In ICCV, pages 11488–11499, 2021.
  47. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  48. Diffhpe: Robust, coherent 3d human pose lifting with diffusion. In ICCV, pages 3220–3229, 2023.
  49. Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418, 2023.
  50. Phasemp: Robust 3d pose estimation via phase-conditioned human motion prior. In ICCV, pages 14725–14737, 2023.
  51. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  52. Local motion phases for learning multi-contact character movements. TOG, 39(4):54–1, 2020.
  53. Neural animation layering for synthesizing martial arts movements. TOG, 40(4):1–16, 2021.
  54. Deepphase: Periodic autoencoders for learning motion phase manifolds. TOG, 41(4):1–13, 2022.
  55. Deep high-resolution representation learning for human pose estimation. In CVPR, pages 5693–5703, 2019.
  56. Monocular, one-stage, regression of multiple 3d people. In ICCV, pages 11179–11188, 2021.
  57. Putting people in their place: Monocular regression of 3d people in depth. In CVPR, pages 13243–13252, 2022.
  58. Social diffusion: Long-term multiple human motion anticipation. In ICCV, pages 9601–9611, 2023.
  59. Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022.
  60. Capturing hands in action using discriminative salient points and physics simulation. IJCV, 118(2):172–193, 2016.
  61. Body size and depth disambiguation in multi-person reconstruction from single images. In 3DV, pages 53–63, 2021.
  62. Neural discrete representation learning. NeurIPS, 30, 2017.
  63. Recovering accurate 3d human pose in the wild using imus and a moving camera. In ECCV, 2018.
  64. Crowd3d: Towards hundreds of people reconstruction from a single image. In CVPR, pages 8937–8946, 2023.
  65. Control strategies for physically simulated characters performing two-player competitive sports. TOG, 40(4):1–11, 2021.
  66. Actformer: A gan-based transformer towards general action-conditioned 3d human motion generation. In ICCV, pages 2228–2238, 2023a.
  67. Joint-relation transformer for multi-person motion prediction. In ICCV, pages 9816–9826, 2023b.
  68. Stochastic multi-person 3d motion forecasting. In ICLR, 2022a.
  69. Interdiff: Generating 3d human-object interactions with physics-informed diffusion. In ICCV, pages 14928–14940, 2023c.
  70. Vitpose: Simple vision transformer baselines for human pose estimation. Advances in Neural Information Processing Systems, 35:38571–38584, 2022b.
  71. A sampling approach to generating closely interacting 3d pose-pairs from 2d annotations. TVCG, 25(6):2217–2227, 2018.
  72. Hi4d: 4d instance segmentation of close human interaction. In CVPR, pages 17016–17027, 2023.
  73. Physdiff: Physics-guided human motion diffusion model. In ICCV, pages 16010–16021, 2023.
  74. Monocular 3d pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In CVPR, pages 2148–2157, 2018a.
  75. Deep network for the integrated 3d sensing of multiple people in natural images. NeurIPS, 31, 2018b.
  76. Body meshes as points. In CVPR, pages 546–556, 2021a.
  77. T2m-gpt: Generating human motion from textual descriptions with discrete representations. arXiv preprint arXiv:2301.06052, 2023a.
  78. Learning motion priors for 4d human body capture in 3d scenes. In ICCV, pages 11343–11353, 2021b.
  79. Simulation and retargeting of complex multi-character interactions. arXiv preprint arXiv:2305.20041, 2023b.
  80. 3d human pose estimation with spatial and temporal transformers. In ICCV, pages 11656–11665, 2021.
  81. On the continuity of rotation representations in neural networks. In CVPR, pages 5745–5753, 2019.
  82. Reconstructing interacting hands with interaction prior from monocular images. In ICCV, pages 9054–9064, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Buzhen Huang (12 papers)
  2. Chen Li (386 papers)
  3. Chongyang Xu (2 papers)
  4. Liang Pan (93 papers)
  5. Yangang Wang (32 papers)
  6. Gim Hee Lee (135 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com