Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

167 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

42 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

6 1 1

STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment Fusion (2401.01730v1)

Published 3 Jan 2024 in cs.CV

Abstract: The recovery of 3D human mesh from monocular images has significantly been developed in recent years. However, existing models usually ignore spatial and temporal information, which might lead to mesh and image misalignment and temporal discontinuity. For this reason, we propose a novel Spatio-Temporal Alignment Fusion (STAF) model. As a video-based model, it leverages coherence clues from human motion by an attention-based Temporal Coherence Fusion Module (TCFM). As for spatial mesh-alignment evidence, we extract fine-grained local information through predicted mesh projection on the feature maps. Based on the spatial features, we further introduce a multi-stage adjacent Spatial Alignment Fusion Module (SAFM) to enhance the feature representation of the target frame. In addition to the above, we propose an Average Pooling Module (APM) to allow the model to focus on the entire input sequence rather than just the target frame. This method can remarkably improve the smoothness of recovery results from video. Extensive experiments on 3DPW, MPII3D, and H36M demonstrate the superiority of STAF. We achieve a state-of-the-art trade-off between precision and smoothness. Our code and more video results are on the project page https://yw0208.github.io/staf/

References (62)

Citations (2)

View on Semantic Scholar

Summary

The paper demonstrates that the STAF model significantly improves video-based 3D human mesh recovery by integrating spatio-temporal alignment.
The methodology employs three key modules—TCFM, SAFM, and APM—to address spatial misalignment and temporal discontinuity in mesh recovery.
Experimental results on benchmarks like 3DPW and Human 3.6M show superior performance metrics compared to prior state-of-the-art models.

STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment Fusion

In recent years, the challenge of recovering 3D human mesh from monocular images has advanced significantly. The paper "STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment Fusion" introduces an innovative approach to tackle the limitations of spatial and temporal discontinuity in existing models. The proposed Spatio-Temporal Alignment Fusion Model (STAF) leverages attention-based mechanisms to enhance coherence and alignment across video frames, achieving superior results in terms of precision and smoothness.

Problem Statement and Motivation

Video-based human mesh recovery holds considerable promise for applications such as motion monitoring, virtual try-on, and VR. Despite the promising developments, traditional models often encounter issues related to the misalignment between mesh and image and temporal discontinuity. These shortcomings detract from the practical usability of such models, particularly in time-sensitive applications. The paper addresses these challenges by introducing a novel approach to embedding spatio-temporal coherence in human mesh recovery.

Methodology

The core contributions of this paper are encapsulated in the Spatio-Temporal Alignment Fusion Model (STAF). The methodology can be divided into three significant components: Temporal Coherence Fusion Module (TCFM), Spatial Alignment Fusion Module (SAFM), and the Average Pooling Module (APM).

Temporal Coherence Fusion Module (TCFM): This module enhances the model's ability to capture long-range temporal dependencies without sacrificing the spatial coherence of the features. Unlike conventional approaches that struggle with long-range dependencies, TCFM employs a self-attention mechanism, supplemented by an additional self-similarity matrix. This matrix guides the encoding process, preserving more accurate temporal correlations.

Spatial Alignment Fusion Module (SAFM): The SAFM focuses on enhancing the spatial feature representation of each target frame by leveraging a multi-stage adjacent feature fusion mechanism. By incorporating human spatial information extracted through projection sampling of initial meshes on feature maps, the module refines the mesh alignment cues effectively.

Average Pooling Module (APM): To address temporal discontinuity, the APM reduces the target frame's over-reliance on positional information by pooling features across the entire input sequence. This not only significantly enhances smoothness, but also improves the overall robustness and precision of the recovered meshes.

Experimental Evaluation

The experimental validation of STAF was conducted on three standard benchmark datasets: 3DPW, MPII3D, and Human 3.6M. Compared to state-of-the-art models such as VIBE, TCMR, and MPS-Net, STAF demonstrated superior performance in terms of PA-MPJPE, MPJPE, and PVE, while achieving a better trade-off between precision and smoothness.

Results on 3DPW and MPII3D: On 3DPW, STAF achieved a PA-MPJPE of 48.0 mm, an MPJPE of 80.6 mm, and a PVE of 95.3 mm. These metrics indicated improvements over previous models like MPS-Net. Additionally, the acceleration error of STAF was found to be 8.2 mm/s², reflecting significant reductions in temporal jitter.

Results on Human 3.6M: Evaluations on Human 3.6M confirmed the robustness of STAF, with a PA-MPJPE of 44.5 mm and an MPJPE of 70.4 mm. Although the acceleration error was slightly higher than in models like TCMR and MPS-Net, the precision metrics highlighted the advantages of incorporating spatio-temporal alignment.

Implications and Future Work

The development of STAF provides a critical stepping stone in video-based human mesh recovery, addressing long-standing issues of temporal and spatial coherence. Practically, this can benefit applications requiring high precision and smoothness in human motion, such as VR, gaming, and surveillance systems.

Theoretically, the introduction of mechanisms like TCFM and SAFM paves the way for further research in integrating temporal and spatial data effectively. Future developments may explore the refinement of these modules or their application to other domains requiring spatio-temporal data processing. Exploring larger datasets and more diverse scenarios will also help generalize the approach and validate its applicability across various environments.

In conclusion, the STAF model presents a sophisticated and effective solution to the challenges in 3D human mesh recovery from video, demonstrating notable improvements in both precision and temporal smoothness. This work not only contributes to the immediate goals of human-centered computer vision but also opens avenues for future innovations in the field.

PDF Markdown

GitHub

STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment Fusion

Tweets

https://twitter.com/2728547289/status/1742958658114232546

YouTube

Show All Videos