Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generative Hierarchical Temporal Transformer for Hand Pose and Action Modeling (2311.17366v3)

Published 29 Nov 2023 in cs.CV

Abstract: We present a novel unified framework that concurrently tackles recognition and future prediction for human hand pose and action modeling. Previous works generally provide isolated solutions for either recognition or prediction, which not only increases the complexity of integration in practical applications, but more importantly, cannot exploit the synergy of both sides and suffer suboptimal performances in their respective domains. To address this problem, we propose a generative Transformer VAE architecture to model hand pose and action, where the encoder and decoder capture recognition and prediction respectively, and their connection through the VAE bottleneck mandates the learning of consistent hand motion from the past to the future and vice versa. Furthermore, to faithfully model the semantic dependency and different temporal granularity of hand pose and action, we decompose the framework into two cascaded VAE blocks: the first and latter blocks respectively model the short-span poses and long-span action, and are connected by a mid-level feature representing a sub-second series of hand poses. This decomposition into block cascades facilitates capturing both short-term and long-term temporal regularity in pose and action modeling, and enables training two blocks separately to fully utilize datasets with annotations of different temporal granularities. We train and evaluate our framework across multiple datasets; results show that our joint modeling of recognition and prediction improves over isolated solutions, and that our semantic and temporal hierarchy facilitates long-term pose and action modeling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. A stochastic conditioning scheme for diverse human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5223–5232, 2020.
  2. Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting. arXiv preprint arXiv:2307.08243, 2023.
  3. Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2272–2281, 2019.
  4. A unified 3d human motion synthesis model via conditional variational auto-encoder. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11645–11655, 2021.
  5. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  6. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
  7. Using an adaptive var model for motion prediction in 3d hand tracking. In 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition, pages 1–8. IEEE, 2008.
  8. Transformer-based unified recognition of two hands manipulating objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4769–4778, 2023.
  9. Forecasting action through contact representations from first person video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  10. Adaptive computationally efficient network for monocular 3d hand pose estimation. In European Conference on Computer Vision, pages 127–144. Springer, 2020.
  11. Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 203–213, 2020.
  12. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1933–1941, 2016.
  13. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
  14. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2021–2029, 2020.
  15. Megatrack: monochrome egocentric articulated hand-tracking for virtual reality. ACM Transactions on Graphics (ToG), 39(4):87–1, 2020.
  16. Umetrack: Unified multi-view end-to-end hand tracking for vr. In SIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022.
  17. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  18. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
  19. Hand pose estimation via latent 2.5 d heatmap regression. In Proceedings of the European Conference on Computer Vision (ECCV), pages 118–134, 2018.
  20. Motiongpt: Human motion as a foreign language. arXiv preprint arXiv:2306.14795, 2023.
  21. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  22. H2o: Two hands manipulating objects for first person interaction recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10138–10148, 2021.
  23. Interacting attention graph for single image two-hand reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2761–2770, 2022.
  24. Delving into egocentric actions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 287–295, 2015.
  25. Forecasting human-object interaction: joint prediction of motor attention and actions in first person video. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 704–721. Springer, 2020.
  26. Joint hand motion and interaction hotspots prediction from egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3282–3292, 2022.
  27. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  28. Posegpt: Quantization-based 3d human motion generation and forecasting. In European Conference on Computer Vision, pages 417–435. Springer, 2022.
  29. Human intention inference and on-line human hand motion prediction for human-robot collaboration. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5958–5964. IEEE, 2019.
  30. Multi-objective diverse human motion prediction with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8161–8171, 2022.
  31. Going deeper into first-person activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1894–1903, 2016.
  32. Weakly-supervised action transition learning for stochastic human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8151–8160, 2022.
  33. Gyeongsik Moon. Bringing inputs to shared domains for 3d interacting hands recovery in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17028–17037, 2023.
  34. Ganerated hands for real-time 3d hand tracking from monocular rgb. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 49–59, 2018.
  35. AssemblyHands: towards egocentric activity understanding via 3d hand pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12999–13008, 2023.
  36. Action-conditioned 3d human motion synthesis with transformer vae. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10985–10995, 2021.
  37. TEMOS: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision (ECCV), 2022.
  38. Learning transferable visual models from natural language supervision. In ICML, 2021.
  39. Humor: 3d human motion model for robust pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11488–11499, 2021.
  40. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6), 2017.
  41. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  42. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21096–21106, 2022.
  43. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12026–12035, 2019.
  44. Phasemp: Robust 3d pose estimation via phase-conditioned human motion prior. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14725–14737, 2023.
  45. First person action recognition using deep learned descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2620–2628, 2016.
  46. Weakly supervised 3d hand pose estimation via biomechanical constraints. In European Conference on Computer Vision, pages 211–228. Springer, 2020.
  47. H+o: Unified egocentric recognition of 3d hand-object poses and interactions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4511–4520, 2019.
  48. Motionclip: Exposing human motion generation to clip space. In European Conference on Computer Vision, pages 358–374. Springer, 2022a.
  49. Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022b.
  50. Pose-ndf: Modeling human pose manifolds with neural distance fields. In European Conference on Computer Vision, pages 572–589. Springer, 2022.
  51. Rgb2hands: real-time tracking of 3d hand interactions from monocular rgb video. ACM Transactions on Graphics (ToG), 39(6):1–16, 2020.
  52. Hierarchical temporal transformer for 3d hand pose estimation and action recognition from egocentric rgb videos. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  53. Collaborative learning of gesture recognition and 3d hand pose estimation with multi-order feature analysis. In European Conference on Computer Vision, pages 769–786. Springer, 2020.
  54. Acr: Attention collaboration-based regressor for arbitrary two-hand reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12955–12964, 2023.
  55. Dlow: Diversifying latent flows for diverse human motion prediction. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 346–364. Springer, 2020.
  56. Motiongpt: Finetuned llms are general-purpose motion generators. arXiv preprint arXiv:2306.10900, 2023.
  57. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2019.
  58. Learning to estimate 3d hand pose from single rgb images. In Proceedings of the IEEE international conference on computer vision, pages 4903–4911, 2017.

Summary

We haven't generated a summary for this paper yet.