HumMUSS: Human Motion Understanding using State Space Models (2404.10880v1)
Abstract: Understanding human motion from video is essential for a range of applications, including pose estimation, mesh recovery and action recognition. While state-of-the-art methods predominantly rely on transformer-based architectures, these approaches have limitations in practical scenarios. Transformers are slower when sequentially predicting on a continuous stream of frames in real-time, and do not generalize to new frame rates. In light of these constraints, we propose a novel attention-free spatiotemporal model for human motion understanding building upon recent advancements in state space models. Our model not only matches the performance of transformer-based models in various motion understanding tasks but also brings added benefits like adaptability to different video frame rates and enhanced training speed when working with longer sequence of keypoints. Moreover, the proposed model supports both offline and real-time applications. For real-time sequential prediction, our model is both memory efficient and several times faster than transformer-based approaches while maintaining their high accuracy.
- Star-transformer: a spatio-temporal cross attention transformer for human action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3330–3339, 2023.
- Posetrack: A benchmark for human pose estimation and tracking. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5167–5176, 2018.
- Exploiting temporal context for 3d human pose estimation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3395–3404, 2019.
- Posebert: A generic transformer module for temporal 3d human modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Blazepose: On-device real-time body pose tracking. arXiv preprint arXiv:2006.10204, 2020.
- Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 561–578. Springer, 2016.
- E Oran Brigham. The fast Fourier transform and its applications. Prentice-Hall, Inc., 1988.
- Poselifter: Absolute 3d human pose lifting network from a single noisy 2d human pose. arXiv preprint arXiv:1910.12029, 2019.
- Hdformer: High-order directed transformer for 3d human pose estimation. arXiv preprint arXiv:2302.01825, 2023.
- Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13359–13368, 2021.
- Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 183–192, 2020.
- Cross-attention of disentangled modalities for 3d human mesh recovery with transformers. In European Conference on Computer Vision, pages 342–359. Springer, 2022.
- Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pages 769–787. Springer, 2020.
- Beyond static features for temporally consistent 3d human pose and shape from a video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1964–1973, 2021.
- Learning to estimate robust 3d human mesh from in-the-wild crowded scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1475–1484, 2022.
- Learnable human mesh triangulation for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2850–2859, 2023.
- Language modeling with gated convolutional networks. In International conference on machine learning, pages 933–941. PMLR, 2017.
- Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2969–2978, 2022.
- Skeletr: Towards skeleton-based action recognition in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13634–13644, 2023.
- Learning to regress bodies from images using differentiable semantic rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11250–11259, 2021.
- Uplift and upsample: Efficient 3d human pose estimation with uplifting transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2903–2913, 2023.
- Unified pose sequence modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13019–13030, 2023.
- Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487, 2020.
- Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
- On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971–35983, 2022a.
- Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022b.
- Estimating human shape and pose from a single image. In 2009 IEEE 12th International Conference on Computer Vision, pages 1381–1388. IEEE, 2009.
- Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35:22982–22994, 2022.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Transformer quality in linear time. In International Conference on Machine Learning, pages 9099–9117. PMLR, 2022.
- Arieh Iserles. A first course in the numerical analysis of differential equations. Number 44. Cambridge university press, 2009.
- Learnable triangulation of human pose. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7718–7727, 2019.
- Coherent reconstruction of multiple humans from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5579–5588, 2020.
- End-to-end recovery of human shape and pose. In Computer Vision and Pattern Recognition (CVPR), 2018a.
- End-to-end recovery of human shape and pose. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7122–7131, 2018b.
- Learning 3d human dynamics from video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5614–5623, 2019.
- Occluded human mesh recovery. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1715–1725, 2022.
- Vibe: Video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5253–5263, 2020.
- Pare: Part attention regressor for 3d human body estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11127–11137, 2021.
- Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2252–2261, 2019a.
- Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In ICCV, 2019b.
- Convolutional mesh regression for single-image human shape reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4501–4510, 2019c.
- Probabilistic modeling for human mesh recovery. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11605–11614, 2021.
- Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3383–3393, 2021a.
- 3d human action representation learning via cross-view consistency pursuit. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4741–4750, 2021b.
- Token boosting for robust self-supervised visual transformer pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24027–24038, 2023.
- Mhformer: Multi-hypothesis transformer for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13147–13156, 2022a.
- Tokenpose: Learning keypoint tokens for human pose estimation. In Proceedings of the IEEE/CVF International conference on computer vision, pages 11313–11322, 2021c.
- Cliff: Carrying location information in full frames into human pose and shape estimation. In European Conference on Computer Vision, pages 590–606. Springer, 2022b.
- End-to-end human pose and mesh reconstruction with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1954–1963, 2021a.
- Mesh graphormer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12939–12948, 2021b.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence, 42(10):2684–2701, 2019.
- Efficientvit: Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14420–14430, 2023.
- Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 143–152, 2020.
- Smpl: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 851–866. 2023.
- 2d/3d pose estimation and action recognition using multitask deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5137–5146, 2018.
- AMASS: Archive of motion capture as surface shapes. In International Conference on Computer Vision, pages 5442–5451, 2019.
- A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE international conference on computer vision, pages 2640–2649, 2017.
- Motionagformer: Enhancing 3d human pose estimation with a transformer-gcnformer network. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6920–6930, 2024.
- Monocular 3d human pose estimation in the wild using improved cnn supervision. In 2017 international conference on 3D vision (3DV), pages 506–516. IEEE, 2017.
- Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947, 2022.
- I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pages 752–768. Springer, 2020.
- Stacked hourglass networks for human pose estimation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, pages 483–499. Springer, 2016.
- Coarse-to-fine volumetric prediction for single-image 3d human pose. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7025–7034, 2017.
- Ordinal depth supervision for 3d human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7307–7316, 2018a.
- Learning to estimate 3d human pose and shape from a single color image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 459–468, 2018b.
- 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7753–7762, 2019.
- Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
- Hstformer: Hierarchical spatial-temporal transformers for 3d human pose estimation. arXiv preprint arXiv:2301.07322, 2023.
- Tessetrack: End-to-end learnable multi-person articulated 3d pose tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15190–15200, 2021.
- Lightweight multi-view 3d pose estimation through camera-disentangled representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6040–6049, 2020.
- P-stmo: Pre-trained spatial temporal many-to-one model for 3d human pose estimation. In European Conference on Computer Vision, pages 461–478. Springer, 2022.
- Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
- End-to-end multi-person pose estimation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11069–11078, 2022.
- Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12026–12035, 2019.
- An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In Proceedings of the AAAI conference on artificial intelligence, 2017.
- Self-supervised 3d skeleton action representation learning with motion consistency and continuity. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13328–13338, 2021.
- Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5693–5703, 2019a.
- Integral human pose regression. In Proceedings of the European conference on computer vision (ECCV), pages 529–545, 2018.
- Human mesh recovery from monocular images via a skeleton-disentangled representation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5349–5358, 2019b.
- Monocular, one-stage, regression of multiple 3d people. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11179–11188, 2021.
- Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
- 3d human pose estimation with spatio-temporal criss-cross attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4790–4799, 2023.
- Lifting from the deep: Convolutional 3d pose estimation from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2500–2509, 2017.
- Rethinking pose in 3d: Multi-stage refinement and recovery for markerless motion capture. In 2018 international conference on 3D vision (3DV), pages 474–483. IEEE, 2018.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European conference on computer vision (ECCV), pages 601–617, 2018.
- Encoder-decoder with multi-level attention for 3d human shape and pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13033–13042, 2021.
- Motion guided 3d pose estimation from videos. In European Conference on Computer Vision, pages 764–780. Springer, 2020.
- Pretraining without attention. arXiv preprint arXiv:2212.10544, 2022.
- Cascaded pyramid network for multi-person pose estimation. 2018.
- Capturing humans in motion: Temporal-attentive 3d human pose and shape estimation from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13211–13220, 2022.
- Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, 2018.
- Unik: A unified framework for real-world skeleton-based action recognition. BMVC, 2021a.
- Skeleton cloud colorization for unsupervised 3d action representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13423–13433, 2021b.
- Learning visibility for robust dense human body estimation. In European Conference on Computer Vision, pages 412–428. Springer, 2022.
- Poserac: Pose saliency transformer for repetitive action counting. arXiv preprint arXiv:2303.08450, 2023.
- Gla-gcn: Global-local adaptive graph convolutional network for 3d human pose estimation from monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8818–8829, 2023.
- Glamr: Global occlusion-aware human mesh recovery with dynamic cameras. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11038–11049, 2022.
- Smoothnet: A plug-and-play network for refining human poses in videos. In European Conference on Computer Vision, pages 625–642. Springer, 2022.
- Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11446–11456, 2021a.
- Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13232–13242, 2022.
- Adafuse: Adaptive multiview fusion for accurate human pose estimation in the wild. International Journal of Computer Vision, 129:703–718, 2021b.
- Poseformerv2: Exploring frequency domain for efficient and robust 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8877–8886, 2023.
- 3d human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11656–11665, 2021.
- Learning discriminative representations for skeleton based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10608–10617, 2023.
- Hemlets pose: Learning part-centric heatmap triplets for accurate 3d human pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2344–2353, 2019.
- Motionbert: A unified perspective on learning human motion representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15085–15099, 2023.
- Arnab Kumar Mondal (23 papers)
- Stefano Alletto (8 papers)
- Denis Tome (58 papers)