Unified Perceptual Parsing for Scene Understanding (1807.10221v1)

Published 26 Jul 2018 in cs.CV

Abstract: Humans recognize the visual world at multiple levels: we effortlessly categorize scenes and detect objects inside, while also identifying the textures and surfaces of the objects along with their different compositional parts. In this paper, we study a new task called Unified Perceptual Parsing, which requires the machine vision systems to recognize as many visual concepts as possible from a given image. A multi-task framework called UPerNet and a training strategy are developed to learn from heterogeneous image annotations. We benchmark our framework on Unified Perceptual Parsing and show that it is able to effectively segment a wide range of concepts from images. The trained networks are further applied to discover visual knowledge in natural scenes. Models are available at \url{https://github.com/CSAILVision/unifiedparsing}.

Authors (5)

Tete Xiao (19 papers)
Yingcheng Liu (7 papers)
Bolei Zhou (134 papers)
Yuning Jiang (106 papers)
Jian Sun (415 papers)

Citations (1,630)

View on Semantic Scholar

Summary

Unified Perceptual Parsing for Scene Understanding

The paper "Unified Perceptual Parsing for Scene Understanding" by Tete Xiao et al. introduces the task of Unified Perceptual Parsing (UPP) alongside a novel multi-task framework, UPerNet, designed to tackle this problem. This task aims to enable machine vision systems to parse a comprehensive set of visual concepts from a single image, ranging from high-level scene categorization to fine-grained material and texture identification. The paper specifically addresses the challenges of learning from heterogeneous image annotations and proposes a unified model to integrate diverse perceptual tasks.

Core Contributions

Definition of Unified Perceptual Parsing (UPP): The authors define UPP as the simultaneous recognition of multiple visual concepts from a single image. These concepts span various perceptual levels including scenes, objects, object parts, materials, and textures.
Development of UPerNet: The authors introduce UPerNet, a robust multi-task framework that effectively learns from heterogeneous image annotations. The network is designed to handle diverse annotations, leveraging hierarchical features from various stages of the network.
Integration of Existing Datasets: The paper consolidates several datasets (ADE20K, Pascal-Context, Pascal-Part, OpenSurfaces, and DTD) into a unified dataset termed Broden+ which contains a rich array of visual annotations necessary for training UPerNet.
Training Strategy: To address the issue of heterogeneous annotations, a unique training strategy is employed. Each batch during training consists of images sampled from different sources, updating only relevant parameters for the specific task, thus maintaining the integrity of the learning process.
Benchmarking and Results: The paper rigorously benchmarks UPerNet on various tasks demonstrating its efficacy in simultaneously parsing scenes, objects, parts, materials, and textures. The results indicate that UPerNet achieves competitive performance on individual tasks without significant degradation when compared to specialized models.

Experimental Findings

The experimental section details a comprehensive evaluation of UPerNet's performance:

Object Parsing: UPerNet's ability to parse objects from the ADE20K and Pascal-Context datasets shows competitive mIoU and pixel accuracy results, comparable with state-of-the-art methods like PSPNet.
Material Parsing: The model exhibits strong performance on OpenSurfaces for material parsing, even improving results when trained alongside other perceptual tasks.
Joint Learning Impact: Despite the heterogeneous nature of the annotations, UPerNet efficiently integrates multi-task learning without significant adverse effects on individual task performances.

Visual Knowledge Discovery

One of the most compelling sections of the paper illustrates how UPerNet can be used to discover implicit visual knowledge within scenes. By analyzing the trained network's outputs, it is possible to infer and visualize relationships among different perceptual levels. For example, the model can identify that certain objects co-occur in specific scene types, or that particular materials frequently compose certain objects. This ability to unify perceptual parsing enhances scene understanding, contributing valuable insights to semantic segmentation and beyond.

Implications and Future Directions

The implications of this research are multifaceted. Practically, UPerNet can be deployed in various applications requiring comprehensive scene understanding, from autonomous vehicles to robotic vision systems. Theoretically, it opens new avenues for exploring hierarchical feature representations and multi-task learning strategies in deep neural networks. The successful integration of diverse visual tasks sets a precedent for future models to holistically approach scene understanding.

Future developments could delve further into enhancing the training strategies for even better integration of heterogeneous annotations. Additionally, expanding the variety of visual concepts and refining the model's ability to parse complex real-world scenes would be natural progressions.

Conclusion

The paper presents significant advancements in the field of computer vision by proposing a unified perceptual parsing approach. The UPerNet framework effectively addresses the tasks of parsing scenes, objects, parts, materials, and textures, forming a comprehensive model suitable for rich scene understanding. Its potential for discovering and leveraging visual knowledge further accentuates its contribution, providing a robust foundation for future research and applications.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - CSAILVision/unifiedparsing: Codebase and pretrained models for ECCV'18 Unified Perceptual Parsing (400 stars)

Tweets

https://twitter.com/Jotarun/status/1033595560618409984