Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos (2303.13397v5)

Published 23 Mar 2023 in cs.CV, cs.AI, cs.HC, and cs.MM

Abstract: Human mesh recovery (HMR) provides rich human body information for various real-world applications. While image-based HMR methods have achieved impressive results, they often struggle to recover humans in dynamic scenarios, leading to temporal inconsistencies and non-smooth 3D motion predictions due to the absence of human motion. In contrast, video-based approaches leverage temporal information to mitigate this issue. In this paper, we present DiffMesh, an innovative motion-aware Diffusion-like framework for video-based HMR. DiffMesh establishes a bridge between diffusion models and human motion, efficiently generating accurate and smooth output mesh sequences by incorporating human motion within the forward process and reverse process in the diffusion model. Extensive experiments are conducted on the widely used datasets (Human3.6M \cite{h36m_pami} and 3DPW \cite{pw3d2018}), which demonstrate the effectiveness and efficiency of our DiffMesh. Visual comparisons in real-world scenarios further highlight DiffMesh's suitability for practical applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Posetrack: A benchmark for human pose estimation and tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5167–5176, 2018.
  2. Articulated 3d head avatar generation using text-to-image diffusion models. arXiv preprint arXiv:2307.04859, 2023.
  3. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017.
  4. Diffusiondet: Diffusion model for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19830–19843, 2023.
  5. Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In ECCV, 2020.
  6. Beyond static features for temporally consistent 3d human pose and shape from a video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1964–1973, 2021a.
  7. Ilvr: Conditioning method for denoising diffusion probabilistic models. arXiv preprint arXiv:2108.02938, 2021b.
  8. Diffusion models in vision: A survey. arXiv preprint arXiv:2209.04747, 2022.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  10. Diffpose: Spatiotemporal diffusion model for video-based human pose estimation. arXiv preprint arXiv:2307.16687, 2023.
  11. Distribution-aligned diffusion for human mesh recovery. arXiv preprint arXiv:2308.13369, 2023.
  12. Diffpose: Toward more reliable 3d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13041–13051, 2023.
  13. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  14. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  15. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, 2014.
  16. End-to-end recovery of human shape and pose. In CVPR, 2018.
  17. Learning 3d human dynamics from video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5614–5623, 2019.
  18. Flame: Free-form language-based motion synthesis & editing. arXiv preprint arXiv:2209.00349, 2022.
  19. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  20. Vibe: Video inference for human body pose and shape estimation. In CVPR, 2020.
  21. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2252–2261, 2019a.
  22. Convolutional mesh regression for single-image human shape reconstruction. In CVPR, 2019b.
  23. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3383–3393, 2021a.
  24. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3383–3393, 2021b.
  25. D&d: Learning human dynamics from dynamic camera. In European Conference on Computer Vision, 2022.
  26. Niki: Neural inverse kinematics with invertible neural networks for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12933–12942, 2023a.
  27. Hybrik-x: Hybrid analytical-neural inverse kinematics for whole-body mesh recovery. arXiv preprint arXiv:2304.05690, 2023b.
  28. Ego-body pose estimation via ego-head pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17142–17151, 2023c.
  29. One-stage 3d whole-body mesh recovery with component aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21159–21168, 2023.
  30. End-to-end human pose and mesh reconstruction with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1954–1963, 2021.
  31. SMPL: A skinned multi-person linear model. ACM TOG, 2015.
  32. 3d human motion estimation via motion compression and refinement. In Proceedings of the Asian Conference on Computer Vision, 2020.
  33. Amass: Archive of motion capture as surface shapes. In ICCV, 2019.
  34. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 3D Vision (3DV), 2017 Fifth International Conference on. IEEE, 2017.
  35. Avatarstudio: Text-driven editing of 3d dynamic human head avatars. arXiv e-prints, pages arXiv–2306, 2023.
  36. Automatic differentiation in pytorch. 2017.
  37. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  38. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  39. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  40. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems, 2022.
  41. Diffusion-based 3d human pose estimation with multi-hypothesis aggregation. arXiv preprint arXiv:2303.11579, 2023.
  42. Global-to-local modeling for video-based 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8887–8896, 2023.
  43. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
  44. Human mesh recovery from monocular images via a skeleton-disentangled representation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5349–5358, 2019.
  45. Human motion diffusion model. 2023.
  46. Recovering 3d human mesh from monocular images: A survey. arXiv preprint arXiv:2203.01923, 2022.
  47. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  48. Recovering accurate 3d human pose in the wild using imus and a moving camera. In European Conference on Computer Vision (ECCV), 2018.
  49. Encoder-decoder with multi-level attention for 3d human shape and pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13033–13042, 2021.
  50. Capturing humans in motion: temporal-attentive 3d human pose and shape estimation from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13211–13220, 2022.
  51. Diffusion models: A comprehensive survey of methods and applications. arXiv preprint arXiv:2209.00796, 2022.
  52. Capturing the motion of every joint: 3d human pose and shape estimation with independent tokens. arXiv preprint arXiv:2303.00298, 2023.
  53. Gator: Graph-aware transformer with motion-disentangled regression for human mesh recovery from a 2d pose. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023a.
  54. Co-evolution of pose and mesh for 3d human body estimation from video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023b.
  55. Deciwatch: A simple baseline for 10×\times× efficient 2d and 3d pose estimation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V, pages 607–624. Springer, 2022a.
  56. Smoothnet: a plug-and-play network for refining human poses in videos. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V, pages 625–642. Springer, 2022b.
  57. Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In Proceedings of the IEEE International Conference on Computer Vision, 2021.
  58. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022.
  59. Modiff: Action-conditioned 3d motion generation with denoising diffusion probabilistic models. arXiv preprint arXiv:2301.03949, 2023.
  60. A lightweight graph transformer network for human mesh reconstruction from 2d human pose. arXiv preprint arXiv:2111.12696, 2021.
  61. Potter: Pooling attention transformer for efficient human mesh recovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1611–1620, 2023a.
  62. Feater: An efficient network for human reconstruction via feature map-based transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
  63. Diff3dhpe: A diffusion model for 3d human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2092–2102, 2023.
  64. Motionbert: A unified perspective on learning human motion representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.

Summary

We haven't generated a summary for this paper yet.