Multiview Detection with Feature Perspective Transformation (2007.07247v2)

Published 14 Jul 2020 in cs.CV and cs.LG

Abstract: Incorporating multiple camera views for detection alleviates the impact of occlusions in crowded scenes. In a multiview system, we need to answer two important questions when dealing with ambiguities that arise from occlusions. First, how should we aggregate cues from the multiple views? Second, how should we aggregate unreliable 2D and 3D spatial information that has been tainted by occlusions? To address these questions, we propose a novel multiview detection system, MVDet. For multiview aggregation, existing methods combine anchor box features from the image plane, which potentially limits performance due to inaccurate anchor box shapes and sizes. In contrast, we take an anchor-free approach to aggregate multiview information by projecting feature maps onto the ground plane (bird's eye view). To resolve any remaining spatial ambiguity, we apply large kernel convolutions on the ground plane feature map and infer locations from detection peaks. Our entire model is end-to-end learnable and achieves 88.2% MODA on the standard Wildtrack dataset, outperforming the state-of-the-art by 14.1%. We also provide detailed analysis of MVDet on a newly introduced synthetic dataset, MultiviewX, which allows us to control the level of occlusion. Code and MultiviewX dataset are available at https://github.com/hou-yz/MVDet.

Citations (79)

View on Semantic Scholar

Summary

The paper introduces an anchor-free multiview aggregation technique using feature perspective transformation to enhance pedestrian detection.
It employs large kernel convolutions for spatial aggregation, capturing extensive contextual information without extra inference structures.
Empirical results show a 14.1% MODA increase on Wildtrack, highlighting its improved robustness in crowded and occlusive environments.

An Overview of Multiview Detection with Feature Perspective Transformation

The paper, "Multiview Detection with Feature Perspective Transformation," presents a novel approach to multiview pedestrian detection by addressing the challenges posed by occlusion and crowdedness. The research introduces MVDet, a method which circumvents some of the limitations associated with traditional anchor-based methods for multiview aggregation and proposes a more integrated approach to spatial aggregation using deep learning techniques.

Key Methodological Contributions

The authors provide two significant contributions: an anchor-free multiview aggregation technique and a fully convolutional approach to spatial aggregation.

Anchor-free Multiview Aggregation: Traditional methods often rely on predefined anchor boxes for multiview aggregation, which can lead to inaccuracies due to assumptions on the human height and width. MVDet departs from this by employing a feature perspective transformation that leverages the projection of feature maps via perspective transformation to aggregate multiview information. This approach significantly reduces dependency on predefined anchor boxes and represents ground plane locations using feature vectors sampled directly from the feature maps. This strategy circumvents errors introduced by incorrect anchor box dimensions, particularly in dynamic environments with varied human postures.
Spatial Aggregation using Large Kernel Convolutions: MVDet employs fully convolutional neural networks to address spatial aggregation, proposing the use of large kernel convolutions. This approach facilitates the aggregation of spatial neighbor information without relying on additional structures like Conditional Random Fields (CRFs) or mean-field inference, which have been prevalent in previous techniques for handling occlusion and overlapping detections. By leveraging large kernels, the method captures a broader spatial context, thus improving decision-making capabilities regarding pedestrian occupancy on the ground plane.

Evaluation and Results

The empirical evaluation was performed on two datasets: Wildtrack and a newly introduced synthetic dataset, MultiviewX. MVDet demonstrated superior performance with a 14.1% increase in MODA on the Wildtrack dataset, achieving an overall MODA of 88.2%. On the MultiviewX dataset, MVDet also achieved competitive results, underscoring its robustness across different crowdedness and occlusion levels. The results underscore the effectiveness of the anchor-free approach, particularly in real-world scenarios (Wildtrack), where variability in human dimensions poses significant challenges.

Implications and Future Directions

The research offers substantial implications for improving pedestrian detection systems in dynamic and crowded environments. By effectively integrating multiview cues without relying on rigid anchor structures, MVDet provides a more adaptable and generalizable approach, leading to enhanced detection capabilities in complex urban settings. Furthermore, the use of large kernel convolutions in spatial aggregation demonstrates an elegant solution that simplifies the overall pipeline by preserving end-to-end trainability.

For future developments, MVDet could inspire new methodologies in multiview tasks beyond pedestrian detection, potentially expanding into applications like automated surveillance or traffic monitoring. Additionally, the integration of other sensory modalities, such as depth information or thermal imaging, could further improve detection accuracy and robustness.

Overall, the innovations presented in this paper provide a compelling argument for moving towards more flexible, anchor-free approaches in multiview detection tasks, leveraging the representational power of learned feature maps and the spatial capabilities of convolutional neural networks.

PDF Markdown

Related Papers

GitHub

GitHub - hou-yz/MVDet: [ECCV 2020] Codes and MultiviewX dataset for "Multiview Detection with Feature Perspective Transformation". (170 stars)