Deep Occlusion Reasoning for Multi-Camera Multi-Target Detection (1704.05775v2)

Published 19 Apr 2017 in cs.CV

Abstract: People detection in single 2D images has improved greatly in recent years. However, comparatively little of this progress has percolated into multi-camera multi-people tracking algorithms, whose performance still degrades severely when scenes become very crowded. In this work, we introduce a new architecture that combines Convolutional Neural Nets and Conditional Random Fields to explicitly model those ambiguities. One of its key ingredients are high-order CRF terms that model potential occlusions and give our approach its robustness even when many people are present. Our model is trained end-to-end and we show that it outperforms several state-of-art algorithms on challenging scenes.

Citations (91)

View on Semantic Scholar

Summary

The paper introduces a novel architecture combining Convolutional Neural Networks (CNNs) with Conditional Random Fields (CRFs) and high-order potential terms to effectively handle occlusions in multi-camera multi-target detection, especially in dense environments.
Unlike traditional methods, the architecture utilizes CNN-generated features and high-order CRF terms for robust occlusion reasoning, enabling end-to-end training in both supervised and unsupervised settings.
Evaluations on multiple datasets demonstrate that the proposed model achieves superior detection performance over baseline systems in crowded scenes, highlighting its potential for improved surveillance and tracking applications.

Deep Occlusion Reasoning for Multi-Camera Multi-Target Detection

The paper presents an innovative architecture designed to improve the performance of Multi-Camera Multi-Target Tracking (MCMT) systems in crowded environments. Traditional MCMT algorithms face significant challenges under high-density conditions, primarily due to occlusions that hinder accurate detection. The authors address these challenges by integrating Convolutional Neural Networks (CNNs) with Conditional Random Fields (CRFs) to explicitly model occlusions, leveraging high-order CRF terms that significantly enhance robustness in crowded scenarios.

Key Contributions and Architecture

The architecture combines CNNs with CRFs to effectively handle occlusions in multi-camera settings for people detection. This approach moves beyond the typical sequence followed in conventional MCMT systems where detections in multiple views are first inferred independently in 2D and then mapped onto a shared reference frame for correspondence and 3D localization. The key innovation is the introduction of high-order CRF terms that represent occlusions more accurately than prior works using deep learning detectors that predominantly rely on single-camera data. Unlike older algorithms that depend on background subtraction for input data, the proposed model utilizes CNN-generated features throughout the detection process.

Technical Overview

The authors propose a CRF energy model comprising three types of potentials: high-order, unary, and pairwise. High-order potentials leverage a novel approach where agreements between generative and discriminative models are computed at each image pixel, facilitating robust occlusion reasoning. The unary potentials provide initial probability estimates at each location on the ground while the pairwise potentials enforce spatial constraints that model physical proximity between detected individuals. The integration of these potentials allows for an end-to-end trainable system that can operate in both supervised and unsupervised settings.

Methodology

The paper outlines how the model is trained and evaluated, emphasizing a Mean-Field inference framework to approximate the posterior distribution over possible detections. The inference process involves finding the Maximum-a-Posteriori (MAP) estimation and is facilitated by using natural gradient descent schemes to iteratively refine probability estimates across grid locations in the ground plane. The proposed CRF model can also utilize unsupervised training methods, relying on inter-view consistency to improve detection performance without labeled data.

Results and Implications

Through comprehensive evaluations on multiple datasets including ETHZ, EPFL, and PETS 2009, the proposed model demonstrated superior performance compared to baseline systems such as POM and CNN-based monocular detection systems. The results highlight the efficacy of high-order CRF terms in resolving occlusions, particularly in dense scenes where traditional methods falter. Practical implications suggest significant improvements in surveillance and tracking applications, where accurate detection across crowded environments is crucial.

Future Directions

The paper concludes by addressing limitations and potential areas for enhancement. Specifically, future work could explore deeper multi-view CNN networks where raw image data is pooled early on to exploit appearance consistency across different views, alongside the CRF framework. The integration of additional temporal consistency mechanisms could also be beneficial in further improving trajectory predictions and localization accuracy.

In summary, this work provides a robust framework for multi-camera detection, effectively handling occlusions and density challenges while setting a foundation for future developments in multi-target tracking systems.

Related Papers

YouTube

Show All Videos