Single camera pose estimation using Bayesian filtering and Kinect motion priors (1405.5047v2)

Published 20 May 2014 in cs.CV and cs.HC

Abstract: Traditional approaches to upper body pose estimation using monocular vision rely on complex body models and a large variety of geometric constraints. We argue that this is not ideal and somewhat inelegant as it results in large processing burdens, and instead attempt to incorporate these constraints through priors obtained directly from training data. A prior distribution covering the probability of a human pose occurring is used to incorporate likely human poses. This distribution is obtained offline, by fitting a Gaussian mixture model to a large dataset of recorded human body poses, tracked using a Kinect sensor. We combine this prior information with a random walk transition model to obtain an upper body model, suitable for use within a recursive Bayesian filtering framework. Our model can be viewed as a mixture of discrete Ornstein-Uhlenbeck processes, in that states behave as random walks, but drift towards a set of typically observed poses. This model is combined with measurements of the human head and hand positions, using recursive Bayesian estimation to incorporate temporal information. Measurements are obtained using face detection and a simple skin colour hand detector, trained using the detected face. The suggested model is designed with analytical tractability in mind and we show that the pose tracking can be Rao-Blackwellised using the mixture Kalman filter, allowing for computational efficiency while still incorporating bio-mechanical properties of the upper body. In addition, the use of the proposed upper body model allows reliable three-dimensional pose estimates to be obtained indirectly for a number of joints that are often difficult to detect using traditional object recognition strategies. Comparisons with Kinect sensor results and the state of the art in 2D pose estimation highlight the efficacy of the proposed approach.

Authors (2)

Michael Burke (37 papers)
Joan Lasenby (32 papers)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces an effective integration of Kinect motion priors with Gaussian mixture models for robust, single-camera upper body pose estimation.
It employs a recursive Bayesian filtering framework that fuses temporal information with simple head and hand detections to infer complete poses.
The proposed method attains near-real-time efficiency and accuracy comparable to multi-camera systems, enhancing its applicability in dynamic environments.

Overview of "Single Camera Pose Estimation Using Bayesian Filtering and Kinect Motion Priors"

The paper by Burke and Lasenby addresses the challenge of upper body pose estimation using monocular vision systems, a task that traditionally demands intricate models and extensive geometric constraints. The authors present an innovative method that reduces computational complexity through the utilization of motion priors derived from Kinect sensor data. This approach leverages Gaussian mixture models (GMMs) to represent probable human poses, enabling efficient pose estimation within a Bayesian filtering framework.

Methodology and Techniques

The proposed system integrates several advanced techniques and models:

Pose Prior Modeling:
- The authors employ GMMs to encapsulate the distribution of human poses using a dataset gathered from Kinect sensors. This captures the inherent likelihood of certain postures occurring and effectively reduces the dimensionality involved in pose estimation.
Bayesian Filtering Framework:
- By adopting a recursive Bayesian filtering strategy, the researchers enhance pose estimation accuracy by incorporating temporal information. The pose estimates are updated recursively, utilizing measurements gathered over time, which significantly enhances robustness in dynamic environments.
Motion Modeling:
- The model represents a mixture of discrete Ornstein-Uhlenbeck processes, where each state undergoes random walk dynamics but shows a tendency to drift toward frequently observed poses. This formulation is critical for maintaining computational efficiency while accurately reflecting realistic biomechanical behavior.
Observation and Transition Models:
- The system measures only the head and hand positions using simple detectors while using these measurements within the GMM-based transition model to infer the positions of other joints, like elbows and shoulders, which are not directly observed.
Computational Efficiency:
- A significant contribution of the paper is its derivation and implementation of a mixture Kalman filter, which reduces computational costs dramatically compared with conventional sampling methods like particle filtering.

Experimental Results and Evaluation

The experimental evaluation conducted in the paper demonstrates the effectiveness of the proposed method. The system achieves reliable 3D pose estimation in cluttered environments with potential camera movement, comparing favorably to state-of-the-art 2D pose estimation techniques. The advantages include:

Accuracy: The probabilistic framework allows indirect estimation of typically challenging joint positions, like those in the upper body, with appreciable accuracy comparable to results from multi-camera setups.
Efficiency: The mixture Kalman filter adaptation provides swift computation, processing at near-real-time speeds, significantly outperforming particle filters in both speed and resource utilization.

Implications and Future Directions

Practically, the approach suggests a path for developing efficient, markerless human-robot interaction systems where payloads and computational resources are limited. Theoretically, it indicates the potential for improving pose estimation models by incorporating priors learned from other sensing technologies (e.g., depth sensors).

In anticipation of future developments, the framework's ability to incorporate additional priors or handle more significant variances from expected poses can be explored. The method might be expanded to handle multiple subjects or complex interactions, enhancing its applicability to broader AI-driven applications.

This paper contributes to the ongoing evolution of monocular vision systems, nudging them closer to the accuracy levels typically achievable only with more elaborate setups. Its balance of efficiency and precision will likely influence subsequent research in computer vision, particularly in gesture and motion recognition contexts.

PDF Markdown

Related Papers

YouTube

Show All Videos