RetailKLIP : Finetuning OpenCLIP backbone using metric learning on a single GPU for Zero-shot retail product image classification (2312.10282v2)

Published 16 Dec 2023 in cs.CV

Abstract: Retail product or packaged grocery goods images need to classified in various computer vision applications like self checkout stores, supply chain automation and retail execution evaluation. Previous works explore ways to finetune deep models for this purpose. But because of the fact that finetuning a large model or even linear layer for a pretrained backbone requires to run at least a few epochs of gradient descent for every new retail product added in classification range, frequent retrainings are needed in a real world scenario. In this work, we propose finetuning the vision encoder of a CLIP model in a way that its embeddings can be easily used for nearest neighbor based classification, while also getting accuracy close to or exceeding full finetuning. A nearest neighbor based classifier needs no incremental training for new products, thus saving resources and wait time.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces RetailKLIP, a method that finetunes an OpenCLIP backbone with metric learning for zero-shot retail product image classification.
It leverages the ArcFace technique on an imbalanced, real-world dataset to eliminate the need for retraining when new products are introduced.
Results indicate that RetailKLIP achieves competitive accuracy while significantly reducing computational overhead in retail applications.

Abstract

The paper discusses an innovative approach to classifying retail product images, specifically packaged grocery goods, using AI for applications such as self-checkout systems. Traditional models require frequent retraining to include new products, but this method uses a modified version of the CLIP model, which eliminates the need for incremental training when new products are introduced. This not only increases efficiency but also saves computational resources.

Introduction

Identification of retail products through image recognition is impactful in various sectors like self-checkout stores and supply chain management. Previous strategies rely heavily on finetuning deep models for this task. However, the fast pace of product launches and design changes in the retail sector necessitates frequent model retraining. The paper suggests an end-to-end process for finetuning a CLIP model's vision encoder to overcome these challenges, assisting in zero-shot classification tasks.

Datasets and Methodology

The work utilized an in-house dataset, called RP6K, with over a million images of 6500 retail products. The dataset is imbalanced, mimicking real-world conditions with a variance in the number of images per product. Several existing datasets such as Grozi-120, CAPG-GP, and RP2K are used for evaluation. A deep ViT-L OpenCLIP model is finetuned on one GPU aided by ArcFace, a technique adept at handling imbalanced data, to create RetailKLIP. This finetuned encoder can then generate image embeddings, or mathematical representations, suitable for categorizing retail product images in a zero-shot manner, meaning it can classify products it has not seen during training.

Results and Discussion

The paper presents the results of RetailKLIP compared to various other models, such as fully finetuned ResNext-WSL and semi-supervised backbones with additional layers. On numerous datasets, RetailKLIP achieved competitive results, challenging the need for more resource-intensive models. One notable advantage of RetailKLIP is its elimination of further training for new product classification, offering a significant speed and efficiency benefit for real-world applications. This advancement provides a practical solution for the swift integration of new retail products without the usual computational overhead, streamlining the maintenance and update processes required in retail-oriented computer vision systems.

PDF Markdown