DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images (1901.07973v1)

Published 23 Jan 2019 in cs.CV

Abstract: Understanding fashion images has been advanced by benchmarks with rich annotations such as DeepFashion, whose labels include clothing categories, landmarks, and consumer-commercial image pairs. However, DeepFashion has nonnegligible issues such as single clothing-item per image, sparse landmarks (4~8 only), and no per-pixel masks, making it had significant gap from real-world scenarios. We fill in the gap by presenting DeepFashion2 to address these issues. It is a versatile benchmark of four tasks including clothes detection, pose estimation, segmentation, and retrieval. It has 801K clothing items where each item has rich annotations such as style, scale, viewpoint, occlusion, bounding box, dense landmarks and masks. There are also 873K Commercial-Consumer clothes pairs. A strong baseline is proposed, called Match R-CNN, which builds upon Mask R-CNN to solve the above four tasks in an end-to-end manner. Extensive evaluations are conducted with different criterions in DeepFashion2.

Citations (331)

View on Semantic Scholar

Summary

The paper constructs a richly annotated benchmark to robustly evaluate tasks such as detection, pose estimation, segmentation, and retrieval in fashion images.
The paper introduces Match R-CNN, an innovative extension of Mask R-CNN that integrates multi-stream features for improved analysis.
Extensive evaluations reveal challenges like occlusion and scale variations, highlighting areas for further research and development.

An Analytical Overview of the DeepFashion2 Benchmark

The research paper presents DeepFashion2, an extensive benchmark aiming to enhance the understanding and analysis of fashion images, addressing a gap in the current datasets available. The authors thoroughly articulate the limitations of existing datasets like the original DeepFashion, which include constraints such as a single clothing item per image, sparse landmarks, and absence of per-pixel masks, inadequately mirroring real-world scenarios.

Objectives and Contributions of DeepFashion2

DeepFashion2 seeks to advance four primary tasks: clothes detection, pose estimation, segmentation, and retrieval, supported by comprehensive annotations. The dataset comprises 801,000 clothing items across 491,000 images, annotated with intricate details such as style, scale, viewpoint, occlusion, bounding boxes, dense landmarks, and masks. It also includes an impressive number of 873,000 commercial-consumer clothing pairs, which is 3.5 times greater than the original DeepFashion dataset.

The paper's contributions are threefold:

Construction of a versatile and richly annotated fashion benchmark that supports a diverse range of image analysis tasks.
Definition of a full spectrum of tasks with DeepFashion2, including a pioneering effort in clothing pose estimation through a detailed landmark and pose schema for 13 categories.
Introduction of Match R-CNN, an innovative extension of Mask R-CNN aimed at solving the proposed tasks in an end-to-end manner. Match R-CNN leverages multiple streams to integrate features learned from different facets of clothing images.

The Empirical Evaluations and Insights

Extensive evaluations conducted using the Mask R-CNN demonstrate the complexities introduced by DeepFashion2. Detections on subsets such as varying scales, occlusion levels, zoom levels, and viewpoints provide a nuanced understanding of the challenges posed by real-world fashion images. The empirical results reflect significant drops in accuracy under conditions of high occlusion, scale variations, and viewpoint changes, which pinpoint areas needing improvement in future work.

The landmark and pose estimation metrics indicate that clothing image analysis can be more challenging than human pose estimation, given the inherent variability and non-rigid deformations present in garments. Furthermore, segmentation results also decline considerably with variations, emphasizing the need for more sophisticated segmentation approaches.

In the clothes retrieval task, the use of ground-truth versus detected bounding boxes shows a clear impact on retrieval accuracy. The integration of classification and pose features significantly enhances retrieval performance, showcasing the benefit of multimodal feature aggregation when addressing the retrieval task.

Implications and Future Directions

DeepFashion2 represents a significant augmentation over existing datasets by encompassing multiple, richly annotated components necessary for improving fashion image analysis technologies. Its comprehensive nature promises to catalyze the development of more robust, adaptable models capable of handling the variability inherent in real-world apparel scenarios.

As AI and computer vision technologies continue to evolve, DeepFashion2 could facilitate advancements in areas like fashion item generation via GANs, dynamic trend analysis, and more sophisticated domain adaptation techniques. The introduction of additional evaluation metrics concerning model efficiency opens a path towards practical applications, making DeepFashion2 a pivotal resource for both academic research and industry innovation.

In conclusion, DeepFashion2, with its extensive data and rigorous tasks, establishes itself as a crucial benchmark in the pursuit of advanced fashion image understanding frameworks, providing fertile ground for future exploration and enhancement in the field of AI-driven fashion analysis.

PDF Markdown