Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing (1804.03287v3)

Published 10 Apr 2018 in cs.CV

Abstract: Despite the noticeable progress in perceptual tasks like detection, instance segmentation and human parsing, computers still perform unsatisfactorily on visually understanding humans in crowded scenes, such as group behavior analysis, person re-identification and autonomous driving, etc. To this end, models need to comprehensively perceive the semantic information and the differences between instances in a multi-human image, which is recently defined as the multi-human parsing task. In this paper, we present a new large-scale database "Multi-Human Parsing (MHP)" for algorithm development and evaluation, and advances the state-of-the-art in understanding humans in crowded scenes. MHP contains 25,403 elaborately annotated images with 58 fine-grained semantic category labels, involving 2-26 persons per image and captured in real-world scenes from various viewpoints, poses, occlusion, interactions and background. We further propose a novel deep Nested Adversarial Network (NAN) model for multi-human parsing. NAN consists of three Generative Adversarial Network (GAN)-like sub-nets, respectively performing semantic saliency prediction, instance-agnostic parsing and instance-aware clustering. These sub-nets form a nested structure and are carefully designed to learn jointly in an end-to-end way. NAN consistently outperforms existing state-of-the-art solutions on our MHP and several other datasets, and serves as a strong baseline to drive the future research for multi-human parsing.

Citations (163)

View on Semantic Scholar

Summary

The paper presents a Nested Adversarial Network that integrates semantic saliency, instance-agnostic parsing, and instance-aware clustering for precise multi-human parsing.
It introduces the extensive MHP v2.0 dataset with 25,403 images and 58 annotated categories, capturing complex human interactions and occlusions.
Evaluation shows significant improvements in metrics like APp and PCP, emphasizing its practical impact on real-world human scene analysis.

Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing

The task of analyzing human presence in crowded scenes challenges the capabilities of visual perception algorithms, necessitating highly nuanced processing techniques. This paper, "Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing," addresses the inadequacies found in current methodologies for parsing human scenes involving multiple interacting individuals. The research introduces the Multi-Human Parsing (MHP v2.0) dataset, employing a novel Nested Adversarial Network (NAN) model to enhance parsing accuracy through sophisticated adversarial learning and structured network design.

Dataset Contributions

The MHP v2.0 dataset stands out for its extensive annotations and scale, with 25,403 images labeled across 58 categories, including body parts and fashion items. The dataset is methodically crafted, featuring images captured from diverse viewpoints and real-world conditions involving complex occlusions and interactions. It surpasses existing datasets by offering more extensive semantic categories and larger image counts, enhancing the robustness and practical applicability of parsing models trained on this dataset.

Methodological Advances

The NAN model proposed in this paper introduces a multi-stage parsing framework that integrates three crucial sub-networks: semantic saliency prediction, instance-agnostic parsing, and instance-aware clustering. These sub-networks are intricately designed to operate jointly within an adversarial framework, enabling seamless training through gradient backpropagation. This nested approach proves effective in distinguishing individual instances and parsing semantic parts concurrently, addressing challenges specific to crowded scenes without resorting to inefficient pre-processing or post-processing routines.

Evaluation and Results

The NAN model demonstrates superior performance benchmarks compared to state-of-the-art methods across several datasets, including MHP v2.0, MHP v1.0, PASCAL-Person-Part, and Buffy. Metrics such as Average Precision based on part ( $\mathrm{AP}^{p}$ ) and Percentage of Correctly parsed semantic Parts (PCP) reveal significant improvements brought by the NAN methodology. The model's capacity to produce high-quality results efficiently reinforces its feasibility for real-world application in scenarios demanding rapid response times and accurate human scene interpretation.

Implications and Future Directions

The implications of this research are multifaceted. Practically, it enhances the parsing capabilities essential for domains like video surveillance, autonomous navigation, and social interaction analysis. Theoretically, it pushes the boundaries of adversarial learning frameworks in solving complex perceptual tasks, suggesting that nested adversarial networks hold considerable promise for future advancements in AI. Looking forward, further expansions of multi-human parsing benchmarks and improvements in semantic segmentation models are expected to refine human-centric analysis, potentially bringing transformative impacts across various technological fields.

Overall, this paper delivers significant advancements in understanding and parsing humans in crowded scenes, setting a new standard in dataset complexity and methodological innovation.

PDF Markdown