SSD-Poser: Avatar Pose Estimation with State Space Duality from Sparse Observations (2504.18332v1)

Published 25 Apr 2025 in cs.CV and cs.HC

Abstract: The growing applications of AR/VR increase the demand for real-time full-body pose estimation from Head-Mounted Displays (HMDs). Although HMDs provide joint signals from the head and hands, reconstructing a full-body pose remains challenging due to the unconstrained lower body. Recent advancements often rely on conventional neural networks and generative models to improve performance in this task, such as Transformers and diffusion models. However, these approaches struggle to strike a balance between achieving precise pose reconstruction and maintaining fast inference speed. To overcome these challenges, a lightweight and efficient model, SSD-Poser, is designed for robust full-body motion estimation from sparse observations. SSD-Poser incorporates a well-designed hybrid encoder, State Space Attention Encoders, to adapt the state space duality to complex motion poses and enable real-time realistic pose reconstruction. Moreover, a Frequency-Aware Decoder is introduced to mitigate jitter caused by variable-frequency motion signals, remarkably enhancing the motion smoothness. Comprehensive experiments on the AMASS dataset demonstrate that SSD-Poser achieves exceptional accuracy and computational efficiency, showing outstanding inference efficiency compared to state-of-the-art methods.

Summary

The paper introduces SSD-Poser, a novel model leveraging State Space Duality for real-time, accurate full-body avatar pose estimation using sparse signals from AR/VR HMDs.
SSD-Poser utilizes a State Space Attention Encoder (SSAE) for efficient spatiotemporal feature extraction and a Frequency-Aware Decoder (FAD) to minimize motion jitter from variable-frequency signals.
Evaluated on the AMASS dataset, SSD-Poser demonstrates superior accuracy and computational efficiency over existing methods, reducing MPJPE, MPJRE, and MPJVE, with significant implications for immersive AR/VR experiences.

Overview of SSD-Poser: Avatar Pose Estimation with State Space Duality from Sparse Observations

The paper "SSD-Poser: Avatar Pose Estimation with State Space Duality from Sparse Observations" addresses the challenges of real-time full-body pose estimation in augmented reality (AR) and virtual reality (VR) environments, using sparse signals from Head-Mounted Displays (HMDs). The authors propose a novel model, SSD-Poser, which leverages the State Space Duality (SSD) framework for efficient and accurate pose reconstruction.

The research identifies two principal hurdles in the existing methods for pose estimation: achieving high inference speed while maintaining reconstruction accuracy. Traditional approaches, often based on neural networks like Transformers or generative models, provide either speed or precision, but not both. SSD-Poser attempts to overcome these issues with its lightweight architecture and state-of-the-art computational efficiency.

Key Components and Innovations

The SSD-Poser model introduces several innovative components:

State Space Attention Encoder (SSAE): This hybrid encoder combines the SSD framework with attention mechanisms from Transformers. The SSAE efficiently extracts dynamic spatiotemporal features while managing computational overhead, thus enabling real-time pose estimation without sacrificing accuracy.
Frequency-Aware Decoder (FAD): To address the issue of motion jitter caused by variable-frequency signals, FAD processes these signals to ensure smooth and realistic motion reconstruction. The decoder integrates a Frequency-Aware Feature Extractor that separates low and high-frequency features, refining the output to minimize jitter.

Numerical Results

The SSD-Poser model's performance was evaluated using the AMASS dataset. It demonstrated superior accuracy and computational efficiency compared to existing methods. Notably, it achieved a reduction in the Mean Per Joint Position Error (MPJPE), Mean Per Joint Rotation Error (MPJRE), and Mean Per Joint Velocity Error (MPJVE), underscoring its strength in both precision and inference speed.

Implications and Future Work

The implications of this work are significant for AR/VR environments, where realistic and responsive avatar interaction is critical. By providing a more efficient and accurate pose estimation method, SSD-Poser enhances the user's immersive experience.

In theoretical terms, this research opens up avenues for further exploration in state space modeling within computer vision applications. Future developments could refine the dual formulation used in SSD or integrate more advanced attention mechanisms to improve long-range dependencies.

Additionally, the authors' successful deployment of the SSD framework in pose estimation suggests potential applications in other areas requiring real-time interaction, such as robotics or industrial automation.