Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection (1708.01642v1)

Published 4 Aug 2017 in cs.CV

Abstract: A major impediment in rapidly deploying object detection models for instance detection is the lack of large annotated datasets. For example, finding a large labeled dataset containing instances in a particular kitchen is unlikely. Each new environment with new instances requires expensive data collection and annotation. In this paper, we propose a simple approach to generate large annotated instance datasets with minimal effort. Our key insight is that ensuring only patch-level realism provides enough training signal for current object detector models. We automatically cut' object instances andpaste' them on random backgrounds. A naive way to do this results in pixel artifacts which result in poor performance for trained models. We show how to make detectors ignore these artifacts during training and generate data that gives competitive performance on real data. Our method outperforms existing synthesis approaches and when combined with real images improves relative performance by more than 21% on benchmark datasets. In a cross-domain setting, our synthetic data combined with just 10% real data outperforms models trained on all real data.

Citations (598)

View on Semantic Scholar

Summary

The paper introduces a novel synthesis method that generates large annotated instance detection datasets using patch-level realism.
It details a process involving foreground segmentation, object blending, and data augmentation to enhance model robustness.
Experimental results show that models trained on the synthetic data perform competitively, especially in cross-domain settings.

Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection

This paper introduces a novel method for generating large annotated datasets for instance detection with minimal effort, addressing a significant barrier in deploying object detection models in diverse environments. The authors propose an approach that focuses on achieving patch-level realism rather than global scene consistency, leveraging the ability of modern object detection methods to rely more on local region-based features.

Methodology

The proposed synthesis process comprises several key steps:

Image Collection: The method utilizes object instance images from existing datasets like BigBIRD and background images from datasets like UW Scenes. These images ensure diversity in viewpoints and backgrounds.
Foreground Segmentation: A Foreground/Background classification network (FCN) is trained to extract object masks automatically, allowing the separation of the object from its background.
Object Placement and Blending: Objects are then pasted onto background scenes. To address the pixel artifacts that emerge from naive pasting, multiple blending techniques are applied to smooth transitions and ensure realism at the patch level. Techniques include Gaussian Blurring and Poisson blending.
Data Augmentation: Additional variability is introduced through rotations, truncations, occlusions, and the inclusion of distractor objects. This augmentation captures diverse viewpoints and scenarios, crucial for robust model training.

Results and Analysis

The synthesized datasets are evaluated using state-of-the-art detection models such as Faster R-CNN. The key findings from the experiments can be summarized as follows:

Competitive Performance: Models trained on the synthetic data perform competitively when compared to those trained on real annotated data. Particularly, the synthetic data yields better results than other synthesis methods focusing on global consistency.
Complementary Insights: Synthetic data addresses the long-tail distribution and poor viewpoint coverage issues inherent to manually curated datasets. The synthesized data outperforms real data when evaluated in cross-domain settings, particularly with limited real data.
Blending and Training Robustness: Varying blending techniques improve the robustness of models by forcing it to focus on object features rather than artifacts. Synthesizing the same images with different blending further aids in achieving this robustness.

Implications and Future Developments

This research highlights the potential of efficiently generating training data for instance detection tasks, facilitating rapid deployment in diverse and personalized environments such as robotics and VR/AR applications. The simplicity and effectiveness of the method suggest that it could be readily adopted or combined with existing global-consistency-based synthesis approaches to create even more effective training datasets.

Future research can explore integrating the authors' techniques with those that ensure global scene consistency or apply advanced rendering models, expanding the applicability to broader scenarios and further improving generalization across various domains. The synthesis method presented here paves the way for scalable dataset creation, crucial for advancing object detection in ever-evolving environments.

PDF Markdown