- The paper introduces RetailKLIP, a method that finetunes an OpenCLIP backbone with metric learning for zero-shot retail product image classification.
- It leverages the ArcFace technique on an imbalanced, real-world dataset to eliminate the need for retraining when new products are introduced.
- Results indicate that RetailKLIP achieves competitive accuracy while significantly reducing computational overhead in retail applications.
Abstract
The paper discusses an innovative approach to classifying retail product images, specifically packaged grocery goods, using AI for applications such as self-checkout systems. Traditional models require frequent retraining to include new products, but this method uses a modified version of the CLIP model, which eliminates the need for incremental training when new products are introduced. This not only increases efficiency but also saves computational resources.
Introduction
Identification of retail products through image recognition is impactful in various sectors like self-checkout stores and supply chain management. Previous strategies rely heavily on finetuning deep models for this task. However, the fast pace of product launches and design changes in the retail sector necessitates frequent model retraining. The paper suggests an end-to-end process for finetuning a CLIP model's vision encoder to overcome these challenges, assisting in zero-shot classification tasks.
Datasets and Methodology
The work utilized an in-house dataset, called RP6K, with over a million images of 6500 retail products. The dataset is imbalanced, mimicking real-world conditions with a variance in the number of images per product. Several existing datasets such as Grozi-120, CAPG-GP, and RP2K are used for evaluation. A deep ViT-L OpenCLIP model is finetuned on one GPU aided by ArcFace, a technique adept at handling imbalanced data, to create RetailKLIP. This finetuned encoder can then generate image embeddings, or mathematical representations, suitable for categorizing retail product images in a zero-shot manner, meaning it can classify products it has not seen during training.
Results and Discussion
The paper presents the results of RetailKLIP compared to various other models, such as fully finetuned ResNext-WSL and semi-supervised backbones with additional layers. On numerous datasets, RetailKLIP achieved competitive results, challenging the need for more resource-intensive models. One notable advantage of RetailKLIP is its elimination of further training for new product classification, offering a significant speed and efficiency benefit for real-world applications. This advancement provides a practical solution for the swift integration of new retail products without the usual computational overhead, streamlining the maintenance and update processes required in retail-oriented computer vision systems.