Combining Local Appearance and Holistic View: Dual-Source Deep Neural Networks for Human Pose Estimation (1504.07159v1)

Published 27 Apr 2015 in cs.CV

Abstract: We propose a new learning-based method for estimating 2D human pose from a single image, using Dual-Source Deep Convolutional Neural Networks (DS-CNN). Recently, many methods have been developed to estimate human pose by using pose priors that are estimated from physiologically inspired graphical models or learned from a holistic perspective. In this paper, we propose to integrate both the local (body) part appearance and the holistic view of each local part for more accurate human pose estimation. Specifically, the proposed DS-CNN takes a set of image patches (category-independent object proposals for training and multi-scale sliding windows for testing) as the input and then learns the appearance of each local part by considering their holistic views in the full body. Using DS-CNN, we achieve both joint detection, which determines whether an image patch contains a body joint, and joint localization, which finds the exact location of the joint in the image patch. Finally, we develop an algorithm to combine these joint detection/localization results from all the image patches for estimating the human pose. The experimental results show the effectiveness of the proposed method by comparing to the state-of-the-art human-pose estimation methods based on pose priors that are estimated from physiologically inspired graphical models or learned from a holistic perspective.

Authors (4)

Xiaochuan Fan (9 papers)
Kang Zheng (20 papers)
Yuewei Lin (33 papers)
Song Wang (313 papers)

Citations (208)

View on Semantic Scholar

Summary

Dual-Source Deep Neural Networks for Human Pose Estimation

The paper introduces an advanced method for 2D human pose estimation using Dual-Source Deep Convolutional Neural Networks (DS-CNN). The primary innovation lies in integrating both local part appearance and a holistic view of the pose. This dual approach is designed to enhance the detection accuracy and localization of human body joints from single images, which is a pervasive challenge in computer vision due to the variability in human poses and appearances, camera angles, and possible occlusions.

Methodology and Architecture

The DS-CNN architecture leverages two parallel streams in its convolutional layers to process inputs: one for part patches and another for body patches. Part patches focus on local body parts, while body patches provide a comprehensive view of the pose, including context around body parts. This approach allows the model to account for both local features of image patches and their contextual relation to the whole body. The CNN's structure is rooted in established deep network architectures, such as AlexNet, but has been adapted to process two distinct input types simultaneously.

The dual-source input is fed through several convolutional layers before being concatenated and processed through fully-connected layers. This setup processes visual information to perform two tasks concurrently: joint detection, which identifies whether a patch contains a body joint, and joint localization, which determines the precise location of these joints within a patch.

Experimental Validation

The paper validates the effectiveness of the proposed DS-CNN by evaluating it on the Leeds Sports Pose (LSP) and Frames Labeled in Cinema (FLIC) datasets. Notably, the model demonstrates a notable improvement over existing methods, including DeepPose and several others that rely on part-based models. The DS-CNN achieves superior accuracy, attributed to its ability to utilize both global and local information for joint detection and localization. The experiments express improvement, particularly in more articulated and challenging poses.

Implications and Future Directions

The DS-CNN framework presents both theoretical and practical implications. Theoretically, this approach suggests that incorporating holistic views along with local detail can significantly improve pose estimation models by resolving ambiguities often caused by limb occlusions and overlapping parts. Practically, this method could be extended to enhance real-time applications like surveillance and human-computer interaction systems by providing more reliability and accuracy in pose detection.

Future work could extend this dual-stream architecture to accommodate three-dimensional pose estimation directly or integrate with other high-level modeling techniques such as graphical models to further refine predictions. Additionally, further exploration into the scalability and optimization of DS-CNN for real-world, multi-person pose estimation contexts would be beneficial to establish its utility in a wider range of applications.

Overall, the proposed Dual-Source Deep Neural Network represents a significant stride in human pose estimation, illustrating the potential for holistic and local integration within neural network architectures to enhance interpretive capability in complex vision tasks.

PDF Markdown