Multi-Level Bottom-Top and Top-Bottom Feature Fusion for Crowd Counting (1908.10937v1)

Published 28 Aug 2019 in cs.CV

Abstract: Crowd counting presents enormous challenges in the form of large variation in scales within images and across the dataset. These issues are further exacerbated in highly congested scenes. Approaches based on straightforward fusion of multi-scale features from a deep network seem to be obvious solutions to this problem. However, these fusion approaches do not yield significant improvements in the case of crowd counting in congested scenes. This is usually due to their limited abilities in effectively combining the multi-scale features for problems like crowd counting. To overcome this, we focus on how to efficiently leverage information present in different layers of the network. Specifically, we present a network that involves: (i) a multi-level bottom-top and top-bottom fusion (MBTTBF) method to combine information from shallower to deeper layers and vice versa at multiple levels, (ii) scale complementary feature extraction blocks (SCFB) involving cross-scale residual functions to explicitly enable flow of complementary features from adjacent conv layers along the fusion paths. Furthermore, in order to increase the effectiveness of the multi-scale fusion, we employ a principled way of generating scale-aware ground-truth density maps for training. Experiments conducted on three datasets that contain highly congested scenes (ShanghaiTech, UCF_CC_50, and UCF-QNRF) demonstrate that the proposed method is able to outperform several recent methods in all the datasets.

Citations (168)

View on Semantic Scholar

Summary

The paper proposes a bidirectional fusion framework that integrates spatial details and semantic context for improved crowd counting.
It employs Scale Complementary Feature Blocks to extract complementary features, reducing ambiguity in complex, congested scenes.
Experimental results demonstrate lower MAE and MSE on benchmark datasets, underlining its practical potential for advanced surveillance and urban planning.

Multi-Level Bottom-Top and Top-Bottom Feature Fusion for Crowd Counting

Crowd counting from single images, especially in highly congested scenes, poses significant challenges due to variations in scales, occlusions, perspective changes, and background clutter. The paper "Multi-Level Bottom-Top and Top-Bottom Feature Fusion for Crowd Counting" addresses the inadequacies of existing multi-scale feature fusion methods in effectively handling these issues.

The authors introduce a multi-level bottom-top and top-bottom fusion (MBTTBF) method for crowd counting which provides a more efficient exchange of information between network layers. Traditional fusion methods involving simple concatenation or addition of features have been shown to be sub-optimal, especially in congested scenes. The proposed architecture enables detailed spatial information and high-level semantic context to be shared bidirectionally across the network layers.

Key Components of the MBTTBF Method

Multi-Level Fusion: The architecture consists of two main fusion paths: bottom-top and top-bottom. The bottom-top path progressively integrates spatial details from shallow to deep layers, aiding in accurate localization. Conversely, the top-bottom path propagates semantic context, which assists in noise suppression in lower layers. This hierarchical multi-level fusion contributes to a more balanced feature representation.
Scale Complementary Feature Blocks (SCFB): These blocks utilize cross-scale residual functions to extract complementary features between adjacent layers. Unlike traditional fusion methods, SCFBs aim to reduce feature ambiguity by computing residuals and ensuring that each layer contributes relevant information to the final density map.
Scale-Aware Density Maps: A novel approach to generate scale-aware ground-truth density maps is proposed. By combining superpixel segmentation and crowd-density approximation in a Markov Random Field framework, this method improves feature robustness against scale variations.

Experimental Results

The authors evaluated their approach on three large datasets: ShanghaiTech, UCF-CC-50, and UCF-QNRF, which are characterized by high-density crowd scenes. The proposed method demonstrated superior performance over several recent methods, consistently achieving lower mean absolute error (MAE) and mean squared error (MSE) values across all datasets. The results underscore the significance of the multi-level bidirectional fusion approach in extracting more informative features conducive to precise crowd counting.

Implications and Future Research Directions

The paper's contributions provide a significant advancement in crowd counting methodologies, with implications for enhancing surveillance systems, urban planning, and event management applications. Future research may involve extending the fusion framework to other domains such as object detection or semantic segmentation, exploring its efficacy in diverse visual understanding tasks. Additionally, integrating attention mechanisms into the fusion process could potentially refine the feature aggregation further, accommodating dynamic scene changes more robustly.

In conclusion, the introduction of a multi-level bottom-top and top-bottom feature fusion framework represents a pivotal step toward more accurate crowd density estimation, paving the way for future explorations in complex environments with varying scales and densities.

PDF Markdown