Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FutureHuman3D: Forecasting Complex Long-Term 3D Human Behavior from Video Observations (2211.14309v3)

Published 25 Nov 2022 in cs.CV and cs.LG

Abstract: We present a generative approach to forecast long-term future human behavior in 3D, requiring only weak supervision from readily available 2D human action data. This is a fundamental task enabling many downstream applications. The required ground-truth data is hard to capture in 3D (mocap suits, expensive setups) but easy to acquire in 2D (simple RGB cameras). Thus, we design our method to only require 2D RGB data at inference time while being able to generate 3D human motion sequences. We use a differentiable 2D projection scheme in an autoregressive manner for weak supervision, and an adversarial loss for 3D regularization. Our method predicts long and complex human behavior sequences (e.g., cooking, assembly) consisting of multiple sub-actions. We tackle this in a semantically hierarchical manner, jointly predicting high-level coarse action labels together with their low-level fine-grained realizations as characteristic 3D human poses. We observe that these two action representations are coupled in nature, and joint prediction benefits both action and pose forecasting. Our experiments demonstrate the complementary nature of joint action and 3D pose prediction: our joint approach outperforms each task treated individually, enables robust longer-term sequence prediction, and improves over alternative approaches to forecast actions and characteristic 3D poses.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (101)
  1. Structured prediction helps 3d human motion modelling. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 7143–7152. IEEE, 2019.
  2. A spatio-temporal transformer for 3d human motion prediction. In International Conference on 3D Vision, 3DV 2021, London, United Kingdom, December 1-3, 2021, pages 565–574. IEEE, 2021.
  3. A stochastic conditioning scheme for diverse human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5223–5232, 2020.
  4. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 214–223. PMLR, 2017.
  5. Compositional video synthesis with action graphs. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 662–673. PMLR, 2021.
  6. Belfusion: Latent diffusion for behavior-driven human motion prediction. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 2317–2327. IEEE, 2023.
  7. Hp-gan: Probabilistic 3d human motion prediction via gan. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 1418–1427, 2018.
  8. The ikea asm dataset: Understanding people assembling furniture through actions, objects and pose. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 847–859, 2021.
  9. Behavior-driven synthesis of human dynamics. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 12236–12246. Computer Vision Foundation / IEEE, 2021.
  10. Motionmixer: Mlp-based 3d human body pose forecasting. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 791–798. ijcai.org, 2022.
  11. Learning progressive joint propagation for human motion prediction. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part VII, pages 226–242. Springer, 2020.
  12. A unified 3d human motion synthesis model via conditional variational auto-encoder∗∗{}^{\mbox{{${{}_{\ast}}$}}}start_FLOATSUPERSCRIPT start_FLOATSUBSCRIPT ∗ end_FLOATSUBSCRIPT end_FLOATSUPERSCRIPT. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 11625–11635. IEEE, 2021.
  13. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  14. Long-term human motion prediction with scene context. In European Conference on Computer Vision, pages 387–404. Springer, 2020.
  15. Quo vadis, action recognition? A new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 4724–4733. IEEE Computer Society, 2017.
  16. Action-agnostic human pose forecasting. In IEEE Winter Conference on Applications of Computer Vision, WACV 2019, Waikoloa Village, HI, USA, January 7-11, 2019, pages 1423–1432. IEEE, 2019.
  17. Towards accurate 3d human motion prediction from incomplete observations. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 4801–4810. Computer Vision Foundation / IEEE, 2021.
  18. Learning dynamic relationships for 3d human motion prediction. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 6518–6526. Computer Vision Foundation / IEEE, 2020.
  19. Mofusion: A framework for denoising-diffusion-based motion synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 9760–9770. IEEE, 2023.
  20. MSR-GCN: multi-scale residual graph convolution networks for human motion prediction. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 11447–11456. IEEE, 2021.
  21. Diverse human motion prediction via gumbel-softmax sampling from an auxiliary space. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, pages 5162–5171. ACM, 2022.
  22. Forecasting characteristic 3d poses of human actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15914–15923, 2022.
  23. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  24. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Trans. Pattern Anal. Mach. Intell., 45(6):7157–7173, 2023.
  25. Uncertainty-aware anticipation of activities. In 2019 IEEE/CVF International Conference on Computer Vision Workshops, ICCV Workshops 2019, Seoul, Korea (South), October 27-28, 2019, pages 1197–1204. IEEE, 2019.
  26. When will you do what? - anticipating temporal occurrences of activities. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 5343–5352. Computer Vision Foundation / IEEE Computer Society, 2018a.
  27. When will you do what? - anticipating temporal occurrences of activities. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 5343–5352. Computer Vision Foundation / IEEE Computer Society, 2018b.
  28. Long-term anticipation of activities with cycle consistency. In Pattern Recognition - 42nd DAGM German Conference, DAGM GCPR 2020, Tübingen, Germany, September 28 - October 1, 2020, Proceedings, pages 159–173. Springer, 2020.
  29. Anticipating human actions by correlating past with the future with jaccard similarity measures. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 13224–13233. Computer Vision Foundation / IEEE, 2021.
  30. Recurrent network models for human dynamics. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 4346–4354. IEEE Computer Society, 2015a.
  31. Recurrent network models for human dynamics. In Proceedings of the IEEE international conference on computer vision, pages 4346–4354, 2015b.
  32. What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 6251–6260. IEEE, 2019.
  33. Rolling-unrolling lstms for action anticipation from first-person video. IEEE transactions on pattern analysis and machine intelligence, 43(11):4021–4036, 2020.
  34. Predicting the future: A jointly learnt model for action anticipation. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 5561–5570. IEEE, 2019.
  35. Anticipative video transformer. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 13485–13495. IEEE, 2021.
  36. Future transformer for long-term action anticipation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 3042–3051. IEEE, 2022.
  37. A neural temporal model for human motion prediction. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 12116–12125. Computer Vision Foundation / IEEE, 2019.
  38. Adversarial geometry-aware human motion prediction. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IV, pages 823–842. Springer, 2018.
  39. Back to MLP: A simple baseline for human motion prediction. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, Waikoloa, HI, USA, January 2-7, 2023, pages 4798–4808. IEEE, 2023.
  40. Social GAN: socially acceptable trajectories with generative adversarial networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 2255–2264. Computer Vision Foundation / IEEE Computer Society, 2018.
  41. Memory-augmented dense predictive coding for video representation learning. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part III, pages 312–329. Springer, 2020.
  42. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell., 36(7):1325–1339, 2014.
  43. Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 3118–3125. IEEE, 2016a.
  44. Structural-rnn: Deep learning on spatio-temporal graphs. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 5308–5317. IEEE Computer Society, 2016b.
  45. Time-agnostic prediction: Predicting predictable video frames. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
  46. Motiondiffuser: Controllable multi-agent motion prediction using diffusion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 9644–9653. IEEE, 2023.
  47. Ticam: A time-of-flight in-car cabin monitoring dataset. In 32nd British Machine Vision Conference 2021, BMVC 2021, Online, November 22-25, 2021, page 277. BMVA Press, 2021.
  48. Time-conditioned action anticipation in one shot. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 9925–9934. Computer Vision Foundation / IEEE, 2019.
  49. Activity forecasting. In European conference on computer vision, pages 201–214. Springer, 2012.
  50. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 2252–2261. IEEE, 2019.
  51. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 780–787, 2014.
  52. Bihmp-gan: Bidirectional 3d human motion prediction GAN. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 8553–8560. AAAI Press, 2019.
  53. Multitask non-autoregressive model for human motion prediction. IEEE Trans. Image Process., 30:2562–2574, 2021a.
  54. Convolutional sequence to sequence model for human dynamics. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 5226–5234. Computer Vision Foundation / IEEE Computer Society, 2018.
  55. Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 211–220. Computer Vision Foundation / IEEE, 2020.
  56. Skeleton graph scattering networks for 3d skeleton-based human motion prediction. In IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2021, Montreal, BC, Canada, October 11-17, 2021, pages 854–864. IEEE, 2021b.
  57. Forecasting human-object interaction: joint prediction of motor attention and actions in first person video. In European Conference on Computer Vision, pages 704–721. Springer, 2020.
  58. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015.
  59. Posegpt: Quantization-based 3d human motion generation and forecasting. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part VI, pages 417–435. Springer, 2022.
  60. AMASS: Archive of motion capture as surface shapes. In International Conference on Computer Vision, pages 5442–5451, 2019.
  61. Learning trajectory dependencies for human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9489–9497, 2019.
  62. History repeats itself: Human motion prediction via motion attention. In European Conference on Computer Vision, pages 474–489. Springer, 2020.
  63. Generating smooth pose sequences for diverse human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13309–13318, 2021.
  64. On human motion prediction using recurrent neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 4674–4683. IEEE Computer Society, 2017.
  65. Pose transformers (POTR): human motion prediction with non-autoregressive transformers. In IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2021, Montreal, BC, Canada, October 11-17, 2021, pages 2276–2284. IEEE, 2021.
  66. HR-STAN: high-resolution spatio-temporal attention network for 3d human motion prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2022, New Orleans, LA, USA, June 19-20, 2022, pages 2539–2548. IEEE, 2022.
  67. Leveraging the present to anticipate the future in videos. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, Long Beach, CA, USA, June 16-20, 2019, pages 2915–2922. Computer Vision Foundation / IEEE, 2019.
  68. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019.
  69. Quaternet: A quaternion-based recurrent model for human motion. In British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018, page 299. BMVA Press, 2018.
  70. Keyframing the future: Keyframe discovery for visual prediction and planning. In Learning for Dynamics and Control, pages 969–979. PMLR, 2020.
  71. First-person activity forecasting with online inverse reinforcement learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 3696–3705, 2017.
  72. Recognizing fine-grained and composite activities using hand-centric features and script data. International Journal of Computer Vision, pages 1–28, 2015.
  73. Action anticipation using pairwise human-object interactions and transformers. IEEE Trans. Image Process., 30:8116–8129, 2021.
  74. Motron: Multimodal probabilistic human motion forecasting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 6447–6456. IEEE, 2022.
  75. Zero-shot anticipation for instructional activities. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 862–871. IEEE, 2019.
  76. Temporal aggregate representations for long-range video understanding. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XVI, pages 154–171. Springer, 2020.
  77. Space-time-separable graph convolutional network for pose forecasting. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 11189–11198. IEEE, 2021.
  78. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 2256–2265. JMLR.org, 2015.
  79. Denoising diffusion implicit models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  80. Learning the predictability of the future. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 12607–12617. Computer Vision Foundation / IEEE, 2021.
  81. 3d object detection with multiple kinects. In European Conference on Computer Vision, pages 93–102. Springer, 2012.
  82. Grab: A dataset of whole-body human grasping of objects. In Computer Vision – ECCV 2020, pages 581–600, Cham, 2020. Springer International Publishing.
  83. Long-term human motion prediction by modeling motion context and enhancing motion dynamics. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, pages 935–941. ijcai.org, 2018.
  84. Social diffusion: Long-term multiple human motion anticipation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9601–9611, 2023.
  85. Human motion diffusion model. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  86. Anticipating visual representations from unlabeled video. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 98–106. IEEE Computer Society, 2016.
  87. The pose knows: Video forecasting by generating pose futures. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 3352–3361. IEEE Computer Society, 2017.
  88. Repnet: Weakly supervised training of an adversarial reprojection network for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7782–7791, 2019.
  89. 3d reconstruction of human motion from monocular image sequences. IEEE transactions on pattern analysis and machine intelligence, 38(8):1505–1516, 2016.
  90. Imitation learning for human pose prediction. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 7123–7132. IEEE, 2019.
  91. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
  92. Learning to anticipate egocentric actions by imagination. IEEE Transactions on Image Processing, 30:1143–1152, 2020.
  93. Eqmotion: Equivariant multi-agent motion prediction with invariant interaction reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1410–1420, 2023a.
  94. Diverse human motion prediction guided by multi-level spatial-temporal anchors. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXII, pages 251–269. Springer, 2022.
  95. Interdiff: Generating 3d human-object interactions with physics-informed diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14928–14940, 2023b.
  96. Mt-vae: Learning motion transformations to generate multimodal human dynamics. In Proceedings of the European Conference on Computer Vision (ECCV), pages 265–281, 2018.
  97. Dlow: Diversifying latent flows for diverse human motion prediction. In European Conference on Computer Vision, pages 346–364. Springer, 2020.
  98. Motiondiffuse: Text-driven human motion generation with diffusion model. CoRR, abs/2208.15001, 2022.
  99. Spatio-temporal gating-adjacency GCN for human motion prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 6437–6446. IEEE, 2022.
  100. UDE: A unified driving engine for human motion generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 5632–5641. IEEE, 2023.
  101. What and how? jointly forecasting human action and pose. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 771–778. IEEE, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Christian Diller (4 papers)
  2. Thomas Funkhouser (66 papers)
  3. Angela Dai (84 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com