Learning to Refine Object Segments (1603.08695v2)

Published 29 Mar 2016 in cs.CV

Abstract: Object segmentation requires both object-level information and low-level pixel data. This presents a challenge for feedforward networks: lower layers in convolutional nets capture rich spatial information, while upper layers encode object-level knowledge but are invariant to factors such as pose and appearance. In this work we propose to augment feedforward nets for object segmentation with a novel top-down refinement approach. The resulting bottom-up/top-down architecture is capable of efficiently generating high-fidelity object masks. Similarly to skip connections, our approach leverages features at all layers of the net. Unlike skip connections, our approach does not attempt to output independent predictions at each layer. Instead, we first output a coarse `mask encoding' in a feedforward pass, then refine this mask encoding in a top-down pass utilizing features at successively lower layers. The approach is simple, fast, and effective. Building on the recent DeepMask network for generating object proposals, we show accuracy improvements of 10-20% in average recall for various setups. Additionally, by optimizing the overall network architecture, our approach, which we call SharpMask, is 50% faster than the original DeepMask network (under .8s per image).

Citations (846)

View on Semantic Scholar

Summary

The paper introduces a top-down refinement method that merges coarse predictions with intermediate features to enhance segmentation accuracy.
It presents SharpMask, a dual-pathway architecture that improves average recall by 10-20% and processes images 50% faster than DeepMask.
The study demonstrates that iterative refinement modules double mask resolution and boost small object detection, advancing dense prediction tasks.

Learning to Refine Object Segments

Object segmentation stands as a critical problem within computer vision, combining the challenges of object detection and semantic segmentation to accurately delineate individual object instances at the pixel level. The inherent complexity involves capturing high-level object information and low-level pixel details. Traditional convolutional neural networks (CNNs) achieve considerable success in object detection, yet their structure, which interleaves convolutional and pooling layers, often results in coarse feature maps that hinder pixel-accurate segmentations. This paper introduces SharpMask, a dual-pathway network architecture that enhances feedforward CNNs with a top-down refinement process, significantly improving segmentation quality.

Approach Overview

SharpMask builds upon the DeepMask architecture, addressing its limitations in generating high-fidelity object masks. Instead of relying solely on feedforward predictions from upper CNN layers, SharpMask employs a hybrid bottom-up/top-down methodology. The primary distinctions and contributions of SharpMask are as follows:

Top-Down Refinement: After an initial coarse mask prediction from a feedforward pass, a top-down refinement pathway sequentially integrates intermediate feature maps, progressively enhancing the mask resolution.
Refinement Modules: Each module merges a coarse mask encoding with features from corresponding feedforward layers. These modules iteratively double the spatial resolution of the mask encoding, producing finer, more accurate segments.
Architectural Simplicity and Efficiency: Unlike skip connections that output independent predictions at each layer, SharpMask's approach streams refined data through a coherent top-down process, optimizing both performance and computational load.

Experimental Results

SharpMask's efficacy is evident through substantial improvements in key metrics over its predecessor, DeepMask, and other contemporary methods:

Average Recall (AR): SharpMask achieves a 10-20% improvement in AR across different setup configurations on the COCO dataset.
Speed: The enhanced architecture achieves a 50% speed improvement over DeepMask, processing images in under 0.8s on average.
Small Object Performance: When applying the model across multiple scales, SharpMask maintains robust performance even with smaller object instances, crucial for practical deployment in diverse real-world scenarios.

Quantitative metrics demonstrate that SharpMask sets new state-of-the-art benchmarks for object proposal recall. Detailed investigations into different architectural choices for both feedforward and top-down pathways elucidated optimal configurations balancing speed and accuracy.

Theoretical and Practical Implications

The refinement architecture introduced by SharpMask could revolutionize pixel-labeling tasks beyond object segmentation, extending to areas such as depth estimation and edge detection. The principles of leveraging multi-scale feature maps and integrating top-down refinement efficiently could be applied to improve various dense prediction tasks.

Future Directions

Further research could explore the generalization of the top-down refinement modules across diverse network architectures, potentially harnessing the power of more advanced models such as residual networks (ResNets). Additionally, integrating advanced training schemes and larger, more varied datasets could refine SharpMask's capabilities and broad applications.

Conclusively, SharpMask presents a rigorous methodological advance, offering a robust solution for generating high-fidelity object segments efficiently. Its practical benefits and theoretical contributions pave the way for future research in efficient, accurate, and scalable object instance segmentation.

PDF Markdown