Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers (1202.2160v2)

Published 10 Feb 2012 in cs.CV and cs.LG

Abstract: Scene parsing, or semantic segmentation, consists in labeling each pixel in an image with the category of the object it belongs to. It is a challenging task that involves the simultaneous detection, segmentation and recognition of all the objects in the image. The scene parsing method proposed here starts by computing a tree of segments from a graph of pixel dissimilarities. Simultaneously, a set of dense feature vectors is computed which encodes regions of multiple sizes centered on each pixel. The feature extractor is a multiscale convolutional network trained from raw pixels. The feature vectors associated with the segments covered by each node in the tree are aggregated and fed to a classifier which produces an estimate of the distribution of object categories contained in the segment. A subset of tree nodes that cover the image are then selected so as to maximize the average "purity" of the class distributions, hence maximizing the overall likelihood that each segment will contain a single object. The convolutional network feature extractor is trained end-to-end from raw pixels, alleviating the need for engineered features. After training, the system is parameter free. The system yields record accuracies on the Stanford Background Dataset (8 classes), the Sift Flow Dataset (33 classes) and the Barcelona Dataset (170 classes) while being an order of magnitude faster than competing approaches, producing a 320 \times 240 image labeling in less than 1 second.

Authors (4)

Camille Couprie (24 papers)
Laurent Najman (45 papers)
Yann LeCun (173 papers)
Clement Farabet (14 papers)

Citations (205)

View on Semantic Scholar

Summary

Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers

The paper "Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers" proposes a method for semantic image segmentation, labeled as scene parsing in computer vision, focusing on the task of labeling each pixel within an image as pertaining to a particular object category. This task combines aspects of detection, segmentation, and recognition, posing unique challenges due to the dependencies on both local and global context within the visual scene.

The proposed method introduces a multiscale convolutional network architecture that extracts features from raw pixel data across multiple scales using a contrast-normalized Laplacian pyramid. In doing so, the method obviates the need for manually designed features, optimizing instead for end-to-end trainability. These multiscale features serve as input for further analysis and segmentation.

A critical component of the method is the construction of a segmentation tree derived from a graph of pixel dissimilarities, identifying image segments by measuring color-based dissimilarity between neighboring pixels. Each segment corresponds to a node in this tree, and segments are represented by a spatial grid of feature vectors produced by max pooling over the pixels within each segment. This formulation is computationally efficient, allowing the process to scale linearly with the number of pixels.

For classification, the method aggregates feature vectors and calculates a class distribution estimate for each segment. A subsequent optimal cover algorithm is employed to select a subset of tree nodes that maximize the "purity" of these class distributions, with purity being inversely proportional to the entropy of the class distribution. This process culminates in an image segmentation where each segment ideally corresponds to a single object class.

The research demonstrates significant advances over existing methods, yielding superior accuracy across various standard datasets: the Stanford Background Dataset, the Sift Flow Dataset, and the Barcelona Dataset. The system achieves this without the complexity of parameter tuning, operating at a rate that allows it to parse images at a much higher speed than competing models — less than a second per image at a resolution of 320x240.

Implications and Future Directions

The method's reliance on a segmentation tree, an efficient multiscale feature extraction approach, and an entropy-based optimal cover selection opens new directions for enhancing fully-automatic scene parsing systems. The hierarchical representation lends itself to potential extensions where alternative or multiple segmentation trees capture more complex scene variations or where other graph structures replace the simple tree framework.

Further refinements might leverage advanced techniques such as structured learning to derive feature representations that improve segmentation initialization and process robustness further. The minimal parameter dependence and the substantial evidence of high-speed processing encourage exploration into real-time semantic segmentation applications, particularly within domains necessitating fast and accurate object detection, such as autonomous driving and augmented reality systems. Through these improvements, the reach of convolutional networks in scene understanding can be expanded, making them integral components of intelligent visual systems.

PDF Markdown

Related Papers

Find Related Papers