Direct Multi-view Multi-person 3D Pose Estimation (2111.04076v2)

Published 7 Nov 2021 in cs.CV

Abstract: We present Multi-view Pose transformer (MvP) for estimating multi-person 3D poses from multi-view images. Instead of estimating 3D joint locations from costly volumetric representation or reconstructing the per-person 3D pose from multiple detected 2D poses as in previous methods, MvP directly regresses the multi-person 3D poses in a clean and efficient way, without relying on intermediate tasks. Specifically, MvP represents skeleton joints as learnable query embeddings and let them progressively attend to and reason over the multi-view information from the input images to directly regress the actual 3D joint locations. To improve the accuracy of such a simple pipeline, MvP presents a hierarchical scheme to concisely represent query embeddings of multi-person skeleton joints and introduces an input-dependent query adaptation approach. Further, MvP designs a novel geometrically guided attention mechanism, called projective attention, to more precisely fuse the cross-view information for each joint. MvP also introduces a RayConv operation to integrate the view-dependent camera geometry into the feature representations for augmenting the projective attention. We show experimentally that our MvP model outperforms the state-of-the-art methods on several benchmarks while being much more efficient. Notably, it achieves 92.3% AP25 on the challenging Panoptic dataset, improving upon the previous best approach [36] by 9.8%. MvP is general and also extendable to recovering human mesh represented by the SMPL model, thus useful for modeling multi-person body shapes. Code and models are available at https://github.com/sail-sg/mvp.

Citations (75)

View on Semantic Scholar

Summary

The paper introduces the Multi-view Pose transformer (MvP), directly regressing multi-person 3D poses without relying on intermediate volumetric representations.
It leverages hierarchical query embeddings and a projective attention mechanism to efficiently fuse multi-view cues with geometric guidance.
MvP achieves significant performance gains, including 92.3% AP₍₂₅₎ on the Panoptic dataset, indicating strong potential for real-time applications.

An Expert Overview of "Direct Multi-view Multi-person 3D Pose Estimation"

The paper "Direct Multi-view Multi-person 3D Pose Estimation" introduces a novel approach termed Multi-view Pose transformer (MvP) for estimating 3D poses of multiple people from multi-view images. The primary innovation lies in directly regressing multi-person 3D poses without the need for intermediate volumetric representations or separate 2D pose processing, marking a departure from traditional methods.

Key Features of MvP

Direct Regression: Unlike reconstruction or volumetric approaches, MvP can directly estimate 3D joint locations efficiently. This is achieved by representing skeleton joints as learnable query embeddings, which interact with the multi-view image inputs to derive 3D poses.
Hierarchical Query Embedding: The hierarchical design of query embeddings facilitates efficient encoding of person-joint relationships, enhancing MvP's capability to generalize to various scenes by leveraging shared joint-level knowledge across different person instances.
Projective Attention Mechanism: A novel component of MvP is the projective attention mechanism. This is designed to precisely fuse multi-view information using geometric guidance, by projecting estimated 3D joints into 2D anchors across different camera views. The model also employs RayConv operations to incorporate camera geometry into the feature space, significantly improving attention accuracy.
Efficiency and Accuracy: The MvP model demonstrates superior performance over previous state-of-the-art methods, with notable results such as achieving 92.3% AP $_{25}$ on the Panoptic dataset, surpassing the previous leading approach by 9.8% while being more computationally efficient.

Implications and Speculations

The direct regression framework of MvP bypasses computationally intensive intermediate tasks prevalent in traditional methods, offering significant advantages in processing speed and scalability, especially beneficial in scenarios with numerous individuals or in real-time applications such as surveillance and virtual reality systems.

The use of a transformer architecture tailored for 3D pose estimation, with an emphasis on direct spatial correspondence and efficient information aggregation, suggests potential broader applications of similar frameworks in other computer vision tasks requiring multi-view spatial reasoning, such as autonomous driving and robotics.

Future Prospects in AI

MvP's architecture, leveraging the strengths of transformers suited for the complex spatial relationships in multi-view data, may continue to evolve with further research into self-supervised learning to reduce training data dependency or enhanced multi-task architectures to jointly handle different representations such as human poses and meshes. Additionally, incorporating unsupervised domain adaptation techniques could bolster the model's robustness across diverse environmental conditions and camera setups.

In conclusion, "Direct Multi-view Multi-person 3D Pose Estimation" presents a comprehensive framework that significantly optimizes the process of understanding human poses in 3D from multi-view inputs. The innovations laid out, particularly the efficient direct regression paradigm and the projective attention model, offer substantial contributions to the field, meriting further exploration and adaptation in various computational visual perception applications.

PDF Markdown

Related Papers

GitHub

GitHub - sail-sg/mvp: NeurIPS-2021: Direct Multi-view Multi-person 3D Human Pose Estimation (314 stars)

YouTube

Show All Videos