FashionFail: Addressing Failure Cases in Fashion Object Detection and Segmentation (2404.08582v1)
Abstract: In the realm of fashion object detection and segmentation for online shopping images, existing state-of-the-art fashion parsing models encounter limitations, particularly when exposed to non-model-worn apparel and close-up shots. To address these failures, we introduce FashionFail; a new fashion dataset with e-commerce images for object detection and segmentation. The dataset is efficiently curated using our novel annotation tool that leverages recent foundation models. The primary objective of FashionFail is to serve as a test bed for evaluating the robustness of models. Our analysis reveals the shortcomings of leading models, such as Attribute-Mask R-CNN and Fashionformer. Additionally, we propose a baseline approach using naive data augmentation to mitigate common failure cases and improve model robustness. Through this work, we aim to inspire and support further research in fashion item detection and segmentation for industrial applications. The dataset, annotation tool, code, and models are available at \url{https://rizavelioglu.github.io/fashionfail/}.
- J. Janai, F. Güney, A. Behl, and A. Geiger, “Computer vision for autonomous vehicles: Problems, datasets and state of the art,” Found. and Trends in Comput. Graph. and Vis., 2020.
- A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR, 2012.
- J. Gao, Y. Yang, P. Lin, and D. S. Park, “Computer vision in healthcare applications,” J. Healthc. Eng., 2018.
- A. Esteva, K. Chou, S. Yeung, N. Naik, A. Madani, A. Mottaghi, et al., “Deep learning-enabled medical computer vision,” NPJ Digit. Med., 2021.
- Y. Wei, S. Tran, S. Xu, B. Kang, and M. Springer, “Deep learning for retail product recognition: Challenges and techniques,” Comput. Intell. Neurosci., 2020.
- V. Shankar, K. Kalyanam, P. Setia, A. Golmohammadi, S. Tirunillai, T. Douglass, et al., “How technology is changing retail,” J. Retail., 2021.
- A. Veit, S. Belongie, and T. Karaletsos, “Conditional similarity networks,” in CVPR, 2017.
- Y.-L. Lin, S. Tran, and L. S. Davis, “Fashion outfit complementary item retrieval,” in CVPR, 2020.
- Y. Hou, E. Vig, M. Donoser, and L. Bazzani, “Learning attribute-driven disentangled representations for interactive fashion retrieval,” in ICCV, 2021.
- X. Han, Z. Wu, Z. Wu, R. Yu, and L. S. Davis, “Viton: An image-based virtual try-on network,” in CVPR, 2018.
- H. Yang, R. Zhang, X. Guo, W. Liu, W. Zuo, and P. Luo, “Towards photo-realistic virtual try-on by adaptively generating-preserving image content,” in CVPR, 2020.
- S. Zhu, R. Urtasun, S. Fidler, D. Lin, and C. Change Loy, “Be your own prada: Fashion synthesis with structural coherence,” in ICCV, 2017.
- M. Jia, M. Shi, M. Sirotenko, Y. Cui, C. Cardie, B. Hariharan, et al., “Fashionpedia: Ontology, segmentation, and an attribute localization dataset,” in ECCV, 2020.
- S. Xu, X. Li, J. Wang, G. Cheng, Y. Tong, and D. Tao, “Fashionformer: A simple, effective and unified baseline for human fashion segmentation and recognition,” in ECCV, 2022.
- ebay.com, “Zara white leather boots.” [Online]. Available: https://bit.ly/3Sfq8Io. Accessed: 2023-12-22.
- adidas.de, “Code five watch.” [Online]. Available: https://bit.ly/47RUSUz. Accessed: 2023-12-22.
- W.-H. Cheng, S. Song, C.-Y. Chen, S. C. Hidayati, and J. Liu, “Fashion meets computer vision: A survey,” ACM Comput. Surv., 2021.
- L. Zhu, D. Yang, T. Zhu, F. Reda, W. Chan, C. Saharia, et al., “Tryondiffusion: A tale of two unets,” in CVPR, 2023.
- J. Kim, G. Gu, M. Park, S. Park, and J. Choo, “Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on,” arXiv, 2023.
- S. He, Y.-Z. Song, and T. Xiang, “Style-based global appearance flow for virtual try-on,” in CVPR, 2022.
- X. Zhang, B. Yang, M. C. Kampffmeyer, W. Zhang, S. Zhang, G. Lu, et al., “Diffcloth: Diffusion based garment synthesis and manipulation via structural cross-modal semantic alignment,” in ICCV, 2023.
- Y. Ding, Z. Lai, P. Mok, and T.-S. Chua, “Computational technologies for fashion recommendation: A survey,” ACM Comput. Surv., 2023.
- R. Sarkar, N. Bodla, M. I. Vasileva, Y.-L. Lin, A. Beniwal, A. Lu, and G. Medioni, “Outfittransformer: Learning outfit representations for fashion recommendation,” in WACV, 2023.
- T. Ye, L. Hu, Q. Zhang, Z. Y. Lai, U. Naseem, and D. D. Liu, “Show me the best outfit for a certain scene: A scene-aware fashion recommender system,” in WWW, 2023.
- W.-C. Kang, E. Kim, J. Leskovec, C. Rosenberg, and J. McAuley, “Complete the look: Scene-based complementary product recommendation,” in CVPR, 2019.
- X. Zou, K. Pang, W. Zhang, and W. Wong, “How good is aesthetic ability of a fashion model?,” in CVPR, 2022.
- Y. Jiao, Y. Gao, J. Meng, J. Shang, and Y. Sun, “Learning attribute and class-specific representation duet for fine-grained fashion analysis,” in CVPR, 2023.
- Y. Han, L. Zhang, Q. Chen, Z. Chen, Z. Li, J. Yang, and Z. Cao, “Fashionsap: Symbols and attributes prompt for fine-grained fashion vision-language pre-training,” in CVPR, 2023.
- X. Han, X. Zhu, L. Yu, L. Zhang, Y.-Z. Song, and T. Xiang, “Fame-vil: Multi-tasking vision-language model for heterogeneous fashion tasks,” in CVPR, 2023.
- X. Han, L. Yu, X. Zhu, L. Zhang, Y.-Z. Song, and T. Xiang, “Fashionvil: Fashion-focused vision-and-language representation learning,” in ECCV, 2022.
- M. Zhuge, D. Gao, D.-P. Fan, L. Jin, B. Chen, H. Zhou, et al., “Kaleido-bert: Vision-language pre-training on fashion domain,” in CVPR, 2021.
- X. Gu, F. Gao, M. Tan, and P. Peng, “Fashion analysis and understanding with artificial intelligence,” Inf. Process. Manage., 2020.
- M. Du, A. Ramisa, A. K. KC, S. Chanda, M. Wang, N. Rajesh, et al., “Amazon shop the look: A visual search system for fashion and home,” in SIGKDD, 2022.
- S. Bell, Y. Liu, S. Alsheikh, Y. Tang, E. Pizzi, M. Henning, et al., “Groknet: Unified computer vision model trunk and embeddings for commerce,” in SIGKDD, 2020.
- H. Hu, Y. Wang, L. Yang, P. Komlev, L. Huang, X. Chen, et al., “Web-scale responsive visual search at bing,” in SIGKDD, 2018.
- Y. Zhang, P. Pan, Y. Zheng, K. Zhao, Y. Zhang, X. Ren, and R. Jin, “Visual search at alibaba,” in SIGKDD, 2018.
- F. Yang, A. Kale, Y. Bubnov, L. Stein, Q. Wang, H. Kiapour, and R. Piramuthu, “Visual search at ebay,” in SIGKDD, 2017.
- J. Lasserre, C. Bracher, and R. Vollgraf, “Street2fashion2shop: Enabling visual search in fashion e-commerce using studio images,” in ICPRAM, 2019.
- E. Li, E. Kim, A. Zhai, J. Beal, and K. Gu, “Bootstrapping complete the look at pinterest,” in SIGKDD, 2020.
- R. Shiau, H.-Y. Wu, E. Kim, Y. L. Du, A. Guo, Z. Zhang, et al., “Shop the look: Building a large scale visual shopping system at pinterest,” in SIGKDD, 2020.
- L. Lefakis, A. Akbik, and R. Vollgraf, “Feidegger: A multi-modal corpus of fashion images and descriptions in german,” in LREC, 2018.
- M. Kucer and N. Murray, “A detect-then-retrieve model for multi-domain fashion item retrieval,” in CVPR Workshops, 2019.
- W. Chen, P. Huang, J. Xu, X. Guo, C. Guo, F. Sun, et al., “Pog: personalized outfit generation for fashion recommendation at alibaba ifashion,” in SIGKDD, 2019.
- Y. Ge, R. Zhang, L. Wu, X. Wang, X. Tang, and P. Luo, “Deepfashion2: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images,” in CVPR, 2019.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV, 2020.
- S. Zheng, F. Yang, M. H. Kiapour, and R. Piramuthu, “Modanet: A large-scale street fashion dataset with polygon annotations,” in ACMMM, 2018.
- M. I. Vasileva, B. A. Plummer, K. Dusad, S. Rajpal, R. Kumar, and D. Forsyth, “Learning type-aware embeddings for fashion compatibility,” in ECCV, 2018.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, et al., “Language models are few-shot learners,” in NeurIPS, 2020.
- S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv, 2023.
- A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, et al., “Segment anything,” arXiv, 2023.
- W. Ji, X. Li, Y. Zhuang, O. E. F. Bourahla, Y. Ji, S. Li, and J. Cui, “Semantic locality-aware deformable network for clothing segmentation,” in IJCAI, 2018.
- A. Gupta, P. Dollar, and R. Girshick, “Lvis: A dataset for large vocabulary instance segmentation,” in CVPR, 2019.
- X. Du, T.-Y. Lin, P. Jin, G. Ghiasi, M. Tan, Y. Cui, et al., “Spinenet: Learning scale-permuted backbone for recognition and localization,” in CVPR, 2020.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
- T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in CVPR, 2017.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021.
- Y. Li, S. Xie, X. Chen, P. Dollar, K. He, and R. Girshick, “Benchmarking detection transfer learning with vision transformers,” arXiv, 2021.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, et al., “Microsoft coco: Common objects in context,” in ECCV, 2014.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
- W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in ECCV, 2016.
- G. Ghiasi, Y. Cui, A. Srinivas, R. Qian, T.-Y. Lin, E. D. Cubuk, et al., “Simple copy-paste is a strong data augmentation method for instance segmentation,” in CVPR, 2021.
- M. R. Taesiri, G. Nguyen, S. Habchi, C.-P. Bezemer, and A. Nguyen, “Imagenet-hard: The hardest images remaining from a study of the power of zoom and spatial biases in image classification,” in NeurIPS, 2023.
- P. Jaccard, “The distribution of the flora in the alpine zone,” The New Phytologist, 1912.
- G. Salton and M. J. Mcgill, “Introduction to modern information retrieval,” 1983.
- M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” Int. J. Comput. Vis., 2010.
- F. Zhao, Y. Huang, L. Wang, and T. Tan, “Deep semantic ranking based hashing for multi-label image retrieval,” in CVPR, 2015.
- J. Hosang, R. Benenson, P. Dollár, and B. Schiele, “What makes for effective detection proposals?,” PAMI, 2015.
- Y. Li, Y. Chen, N. Wang, and Z. Zhang, “Scale-aware trident networks for object detection,” in ICCV, 2019.
- Z. Zhang, C. Pan, and J. Peng, “Delving into the effectiveness of receptive fields: Learning scale-transferrable architectures for practical object detection,” Int. J. Comput. Vis., 2022.