FLEX: Extrinsic Parameters-free Multi-view 3D Human Motion Reconstruction (2105.01937v4)

Published 5 May 2021 in cs.CV, cs.GR, and cs.LG

Abstract: The increasing availability of video recordings made by multiple cameras has offered new means for mitigating occlusion and depth ambiguities in pose and motion reconstruction methods. Yet, multi-view algorithms strongly depend on camera parameters; particularly, the relative transformations between the cameras. Such a dependency becomes a hurdle once shifting to dynamic capture in uncontrolled settings. We introduce FLEX (Free muLti-view rEconstruXion), an end-to-end extrinsic parameter-free multi-view model. FLEX is extrinsic parameter-free (dubbed ep-free) in the sense that it does not require extrinsic camera parameters. Our key idea is that the 3D angles between skeletal parts, as well as bone lengths, are invariant to the camera position. Hence, learning 3D rotations and bone lengths rather than locations allows predicting common values for all camera views. Our network takes multiple video streams, learns fused deep features through a novel multi-view fusion layer, and reconstructs a single consistent skeleton with temporally coherent joint rotations. We demonstrate quantitative and qualitative results on three public datasets, and on synthetic multi-person video streams captured by dynamic cameras. We compare our model to state-of-the-art methods that are not ep-free and show that in the absence of camera parameters, we outperform them by a large margin while obtaining comparable results when camera parameters are available. Code, trained models, and other materials are available on our project page.

Citations (15)

View on Semantic Scholar

Summary

The paper presents FLEX, a framework that bypasses extrinsic camera parameters by predicting view-invariant 3D joint rotations and bone lengths.
It employs a multi-view fusion layer with multi-head attention to integrate inputs, achieving competitive MPJPE scores on standard datasets.
The results demonstrate FLEX's robustness in dynamic, uncontrolled environments, significantly simplifying multi-view setups for practical applications.

Overview of "FLEX: Extrinsic Parameters-free Multi-view 3D Human Motion Reconstruction"

The paper introduces FLEX, an innovative multi-view 3D human motion reconstruction framework that operates independently of extrinsic camera parameters. Traditional methods for multi-view reconstruction heavily rely on precise camera parameters to resolve occlusions and depth ambiguities. However, FLEX circumvents this requirement by leveraging the inherent invariance of 3D angles between skeletal components and bone lengths to different camera positions. This approach removes a significant barrier often encountered in real-world, dynamic, and uncontrolled capture settings where camera setups are non-static, such as sporting events. FLEX processes multi-view video streams, extracts fused features via a novel multi-view fusion layer, and reconstructs a consistent 3D skeletal motion representation characterized by temporally coherent joint rotations across all views.

Methodology

FLEX is realized through a deep convolutional network designed to predict 3D joint rotations and bone lengths directly. These features are view-invariant, allowing for a consistent representation across multiple camera perspectives without needing the extrinsic parameters that encode camera rotations and translations. FLEX incorporates a fusion mechanism composed of a multi-view convolutional layer and a multi-head attention system to effectively integrate inputs from several video streams. Thus, it promotes the fusion of data from different views, enhancing robustness against occlusions and ambiguities.

The network's architecture features two branches: one predicting dynamic features such as 3D joint rotations and another predicting a static skeleton represented by bone lengths. The temporal consistency is ensured by utilizing temporal data across frames, contributing to smoothness in motion. The framework is evaluated using established metrics such as the Mean Per Joint Position Error (MPJPE) and displays competitive performance compared to state-of-the-art methods, particularly in situations lacking extrinsic parameters.

Results and Implications

Quantitative evaluations reveal that FLEX outperforms existing methodologies in scenarios without available extrinsic camera parameters, achieving superior MPJPE scores. For example, without relying on these parameters, FLEX delivers remarkable accuracy on datasets like Human3.6M and KTH Multi-view Football II, leading the front on the Ski-Pose PTZ-Camera dataset. It demonstrates its effectiveness even in dynamic, multi-person environments, attesting to its robustness and adaptability to complex, real-world scenarios. The model maintains competitive performance in sessions with known parameters, attesting to its generalization capacity.

From a theoretical perspective, the success of FLEX implies that camera extrinsic parameters, often deemed crucial, can be obviated effectively by emphasizing intrinsic properties of human motion data. Practically, this approach simplifies the setup for multi-view systems, encouraging broader application in varied fields such as animation and sports analysis.

Future Directions

The FLEX framework raises interesting possibilities for further exploration. Future research could explore enhancing global position estimation without relying on camera intrinsics, refining the framework's compatibility with different skeleton structures, or adapting it towards real-time reconstruction systems. Furthermore, the ability of FLEX to potentially infer inter-camera transformations offers a promising direction for research.

In conclusion, FLEX presents a paradigm shift in multi-view human motion reconstruction, showcasing the potential to redefine problem statements previously bound by camera parameter dependencies, thus opening new avenues for research and application in real-world dynamic settings.

PDF Markdown

Related Papers

YouTube

Show All Videos