Exploiting Temporal Contexts with Strided Transformer for 3D Human Pose Estimation (2103.14304v8)

Published 26 Mar 2021 in cs.CV

Abstract: Despite the great progress in 3D human pose estimation from videos, it is still an open problem to take full advantage of a redundant 2D pose sequence to learn representative representations for generating one 3D pose. To this end, we propose an improved Transformer-based architecture, called Strided Transformer, which simply and effectively lifts a long sequence of 2D joint locations to a single 3D pose. Specifically, a Vanilla Transformer Encoder (VTE) is adopted to model long-range dependencies of 2D pose sequences. To reduce the redundancy of the sequence, fully-connected layers in the feed-forward network of VTE are replaced with strided convolutions to progressively shrink the sequence length and aggregate information from local contexts. The modified VTE is termed as Strided Transformer Encoder (STE), which is built upon the outputs of VTE. STE not only effectively aggregates long-range information to a single-vector representation in a hierarchical global and local fashion, but also significantly reduces the computation cost. Furthermore, a full-to-single supervision scheme is designed at both full sequence and single target frame scales applied to the outputs of VTE and STE, respectively. This scheme imposes extra temporal smoothness constraints in conjunction with the single target frame supervision and hence helps produce smoother and more accurate 3D poses. The proposed Strided Transformer is evaluated on two challenging benchmark datasets, Human3.6M and HumanEva-I, and achieves state-of-the-art results with fewer parameters. Code and models are available at \url{https://github.com/Vegetebird/StridedTransformer-Pose3D}.

Citations (160)

View on Semantic Scholar

Summary

The paper presents the Strided Transformer Encoder (STE) that replaces feed-forward layers with strided convolutions to reduce sequence length and capture local temporal dependencies.
It employs a full-to-single supervision scheme that enforces both temporal smoothness and precision, achieving state-of-the-art results on benchmark datasets.
Empirical results on Human3.6M and HumanEva-I datasets demonstrate reduced MPJPE and P-MPJPE metrics with fewer parameters, highlighting its potential for real-time applications.

Exploiting Temporal Contexts with Strided Transformer for 3D Human Pose Estimation

The paper entitled "Exploiting Temporal Contexts with Strided Transformer for 3D Human Pose Estimation" introduces a novel approach leveraging Transformer-based architectures for the task of 3D human pose estimation from video sequences. This research addresses the challenge of efficiently translating redundant 2D joint sequences into accurate 3D poses by introducing the Strided Transformer model.

Key Contributions

Strided Transformer Encoder (STE):
- The proposed STE modifies the Vanilla Transformer Encoder (VTE) by replacing fully-connected layers in the feed-forward network with strided convolutions. This adjustment aims to progressively reduce the sequence length and enhances the ability of the model to aggregate information from local temporal contexts, leading to significant computational savings.
- This approach allows the model to efficiently condense a long sequence into a single-vector representation while modeling both global and local dependencies in a hierarchical manner.
Full-to-Single Supervision Scheme:
- A dual-scale supervision mechanism is introduced, guiding the model through full sequence scale for temporal smoothness and single target frame scale for precision in estimating the 3D pose.
- This scheme provides robust temporal constraints, refining the model's ability to output smoother and more accurate 3D poses.
Empirical Evidence and Results:
- The authors demonstrate the effectiveness of their method with state-of-the-art results on the Human3.6M and HumanEva-I datasets. The Strided Transformer achieves these results with fewer parameters compared to existing methods.
- Significant improvement is highlighted in both MPJPE and P-MPJPE metrics compared to baseline methods, showcasing the model's proficiency in capturing and utilizing temporal contexts.

Theoretical and Practical Implications

The paper suggests considerable implications for modeling long-range dependencies in video-based 3D pose estimation. The integration of strided convolutions within the Transformer architecture presents a promising direction for reducing computational overhead while maintaining, or even improving, performance accuracy.

Practically, this model's reduced parameter count combined with high accuracy suggests potential applications in real-time systems where computational efficiency is critical. Additionally, the full-to-single supervision scheme could be adapted to other tasks involving temporal sequences, such as action recognition or video surveillance.

Future Directions

Future explorations might include further refinement of attention mechanisms within the Transformer to enhance efficiency without compromising accuracy. Additionally, extending the model to multi-view or multi-person scenarios would be a logical progression, potentially broadening its applicability in real-world settings.

Overall, this paper offers a substantial contribution to the field of 3D human pose estimation, presenting an effective method for exploiting temporal context and demonstrating significant potential for future research and application developments.

PDF Markdown

Related Papers

GitHub

GitHub - Vegetebird/StridedTransformer-Pose3D: [TMM 2022] Exploiting Temporal Contexts with Strided Transformer for 3D Human Pose Estimation (343 stars)