FreeMask: Synthetic Images with Dense Annotations Make Stronger Segmentation Models (2310.15160v1)

Published 23 Oct 2023 in cs.CV

Abstract: Semantic segmentation has witnessed tremendous progress due to the proposal of various advanced network architectures. However, they are extremely hungry for delicate annotations to train, and the acquisition is laborious and unaffordable. Therefore, we present FreeMask in this work, which resorts to synthetic images from generative models to ease the burden of both data collection and annotation procedures. Concretely, we first synthesize abundant training images conditioned on the semantic masks provided by realistic datasets. This yields extra well-aligned image-mask training pairs for semantic segmentation models. We surprisingly observe that, solely trained with synthetic images, we already achieve comparable performance with real ones (e.g., 48.3 vs. 48.5 mIoU on ADE20K, and 49.3 vs. 50.5 on COCO-Stuff). Then, we investigate the role of synthetic images by joint training with real images, or pre-training for real images. Meantime, we design a robust filtering principle to suppress incorrectly synthesized regions. In addition, we propose to inequally treat different semantic masks to prioritize those harder ones and sample more corresponding synthetic images for them. As a result, either jointly trained or pre-trained with our filtered and re-sampled synthesized images, segmentation models can be greatly enhanced, e.g., from 48.7 to 52.0 on ADE20K. Code is available at https://github.com/LiheYoung/FreeMask.

Citations (31)

View on Semantic Scholar

Summary

The paper introduces FreeMask, which generates synthetic image-mask pairs to overcome annotation bottlenecks and achieve competitive segmentation performance.
It employs robust filtering and re-sampling strategies to mitigate artifacts and focus training on complex semantic masks.
Experiments demonstrate that synthetic-image-trained models reach 48.3% mIoU on ADE20K and 49.3% mIoU on COCO-Stuff, nearly matching real-data performance.

Overview of "FreeMask: Synthetic Images with Dense Annotations Make Stronger Segmentation Models"

The paper "FreeMask: Synthetic Images with Dense Annotations Make Stronger Segmentation Models" offers a methodological contribution to the field of semantic segmentation by leveraging synthetic images generated through advanced generative models. The authors address the critical issue of the annotation bottleneck, which poses considerable challenges for training segmentation models due to the high cost and complexity involved in acquiring precise pixel-level labels. By utilizing synthetic images, the paper aims to enhance the performance of semantic segmentation models under fully-supervised conditions.

The core of the proposed method—termed FreeMask—lies in the generation of synthetic images conditioned on semantic masks derived from existing datasets. The resulting image-mask pairs serve as additional training resources, thereby reducing dependency on expensive and labor-intensive annotation processes. A notable finding of this paper is the comparable performance achieved by models trained solely on synthetic images when evaluated against those trained on real-world images. For example, models trained on synthetic images attained 48.3% mean Intersection-over-Union (mIoU) on ADE20K and 49.3% mIoU on COCO-Stuff, closely rivaling real-data-based models, which reached 48.5% and 50.5% mIoU respectively.

To improve the utility of synthetic images, the authors introduce several strategies. First, a robust filtering method is proposed to manage the artifacts and errors inherent in synthesized images, which could otherwise degrade model performance. The approach involves evaluating the pixels in synthetic images using a pre-trained model on real data, filtering those deemed erroneous based on class-specific average losses. Secondly, a re-sampling technique is applied to disproportionately focus on more complex semantic masks. This ensures that training data includes rigorously sampled images from harder scenarios, which are more beneficial for enhancing the network's generalization capabilities.

The experiments demonstrate significant performance gains when combining real and synthetic training data. Joint training and pre-training paradigms are employed; the former involves training on an augmented dataset that includes over-sampled real images alongside synthetic counterparts, while the latter uses synthetic images for an initial model training phase followed by fine-tuning on real images. This strategic integration yields marked improvements; for instance, joint training elevates segmentation scores on ADE20K from 48.7% to 52.0% mIoU.

The implications of this research are profound for both theoretical exploration and practical applications in AI. By effectively doubling or even tripling accessible training data without prohibitive additional costs, this approach could democratize access to high-performing models, especially in resource-constrained settings. Looking toward future developments, the integration of even more sophisticated generative models and the creation of tailored synthetization pipelines suited to individual use cases could push the boundaries of what is achievable in semantic segmentation and related fields within computer vision.

This paper underlines the potential for synthetic data to not only supplement but potentially replace real training data in reliable circumstances, opening pathways for exploration in privacy-sensitive applications, areas with limited data access, and diverse real-world scenarios where annotation resources are sparse. The role of generative models in automatic model improvement delineates a promising future for advancing the efficacy and inclusivity of machine learning methodologies.

PDF Markdown

Related Papers

GitHub

GitHub - LiheYoung/FreeMask: [NeurIPS 2023] FreeMask: Synthetic Images with Dense Annotations Make Stronger Segmentation Models (126 stars)