Is a picture worth a thousand words? A Deep Multi-Modal Fusion Architecture for Product Classification in e-commerce (1611.09534v1)

Published 29 Nov 2016 in cs.CV and cs.CL

Abstract: Classifying products into categories precisely and efficiently is a major challenge in modern e-commerce. The high traffic of new products uploaded daily and the dynamic nature of the categories raise the need for machine learning models that can reduce the cost and time of human editors. In this paper, we propose a decision level fusion approach for multi-modal product classification using text and image inputs. We train input specific state-of-the-art deep neural networks for each input source, show the potential of forging them together into a multi-modal architecture and train a novel policy network that learns to choose between them. Finally, we demonstrate that our multi-modal network improves the top-1 accuracy % over both networks on a real-world large-scale product classification dataset that we collected fromWalmart.com. While we focus on image-text fusion that characterizes e-commerce domains, our algorithms can be easily applied to other modalities such as audio, video, physical sensors, etc.

Citations (61)

View on Semantic Scholar

Summary

The paper demonstrates that decision-level fusion enhances classification accuracy by combining separate text and image CNN classifiers.
It leverages a large-scale dataset from Walmart.com with 1.2M products across nearly 2,900 categories to validate the multi-modal approach.
The study reveals that while the text CNN outperforms the image CNN overall, the fusion strategy captures complementary strengths for improved results.

A Deep Multi-Modal Fusion Architecture for Product Classification in E-commerce

This paper addresses a prominent issue in e-commerce: effective product classification that facilitates improved search and recommendation systems. With the rapid influx of new products and dynamically evolving categories, traditional manual classification methods are increasingly inefficient, necessitating robust machine learning solutions. The authors propose an innovative multi-modal fusion architecture that leverages both text and image inputs, intending to enhance classification accuracy in e-commerce settings.

The research introduces a decision-level fusion approach, distinguishing it from previously established feature-level fusion methods. In their model, input-specific deep neural networks are trained separately on text and image data sources before combining their strengths with a novel policy network. This policy network learns to optimally switch between the text CNN and image CNN classifiers, based on the predicted efficacy of each input type for a given product, effectively enhancing overall classification performance.

One of the significant contributions of this work is the large-scale dataset collected from Walmart.com, comprising approximately 1.2 million products. This dataset presents a particularly intricate multi-class, multi-label classification challenge, with nearly 2,900 possible shelf categories. The paper highlights several critical findings:

Performance Differentials: The text CNN generally outperforms the image CNN on this dataset, achieving a notable top-1 accuracy improvement. This result underscores the greater predictive power of textual metadata in e-commerce classification contexts, given its typically rich semantic content compared to visual data.
Potential of Multi-Modal Fusion: Despite the textual model’s superior performance, the analysis elucidated that approximately 8% of products were correctly classified by the image model while misclassified by the text model. This discrepancy presented an opportunity to leverage multi-modal fusion, ultimately yielding classification improvements.
Decision-Level Fusion Strategy: The authors demonstrate that a decision-level fusion approach, learned using class probabilities of the text and image networks, surpassed both individual models in classification accuracy. The policy network achieved an improvement of 1.6% in top-1 accuracy, illustrating its effectiveness in harnessing text and image inputs’ complementary strengths.

This research suggests significant implications for the future of product classification within e-commerce platforms. Emphasizing decision-level fusion advocates a shift from traditional feature-level fusion methods, highlighting the benefits in computational efficiency and classificatory precision. Furthermore, despite establishing promising results, the paper acknowledges the potential to expand upon their findings through future investigations into more sophisticated policy models and ensemble techniques combining multiple image and text classifiers.

In conclusion, the authors successfully demonstrate that while text-based classification remains a powerful tool, integrating multi-modal approaches can further refine classification mechanisms in complex e-commerce environments. As large-scale, multi-class, multi-label datasets continue to challenge conventional classification algorithms, this paper presents a substantive foundation for utilizing multi-modal deep learning architectures in practice.

PDF Markdown

Related Papers

YouTube

Show All Videos