Papers
Topics
Authors
Recent
2000 character limit reached

TPDR: A Novel Two-Step Transformer-based Product and Class Description Match and Retrieval Method

Published 5 Oct 2023 in cs.IR, cs.LG, and cs.SE | (2310.03491v1)

Abstract: There is a niche of companies responsible for intermediating the purchase of large batches of varied products for other companies, for which the main challenge is to perform product description standardization, i.e., matching an item described by a client with a product described in a catalog. The problem is complex since the client's product description may be: (1) potentially noisy; (2) short and uninformative (e.g., missing information about model and size); and (3) cross-language. In this paper, we formalize this problem as a ranking task: given an initial client product specification (query), return the most appropriate standardized descriptions (response). In this paper, we propose TPDR, a two-step Transformer-based Product and Class Description Retrieval method that is able to explore the semantic correspondence between IS and SD, by exploiting attention mechanisms and contrastive learning. First, TPDR employs the transformers as two encoders sharing the embedding vector space: one for encoding the IS and another for the SD, in which corresponding pairs (IS, SD) must be close in the vector space. Closeness is further enforced by a contrastive learning mechanism leveraging a specialized loss function. TPDR also exploits a (second) re-ranking step based on syntactic features that are very important for the exact matching (model, dimension) of certain products that may have been neglected by the transformers. To evaluate our proposal, we consider 11 datasets from a real company, covering different application contexts. Our solution was able to retrieve the correct standardized product before the 5th ranking position in 71% of the cases and its correct category in the first position in 80% of the situations. Moreover, the effectiveness gains over purely syntactic or semantic baselines reach up to 3.7 times, solving cases that none of the approaches in isolation can do by themselves.

Citations (3)

Summary

  • The paper introduces TPDR, a two-step transformer-based method that integrates semantic and syntactic retrieval to standardize product descriptions.
  • It leverages dual encoders with TaG-Training and contrastive learning, optimizing shared embedding spaces via an N-pair loss function.
  • Experimental evaluations show up to 3.76x accuracy improvements over traditional methods, highlighting its effectiveness in multilingual retrieval.

TPDR: A Two-Step Transformer-Based Product and Class Description Retrieval Method

Introduction

The challenge of product description standardization in industry arises from the complexity of matching a client's product descriptions, often plagued by noise, short length, and cross-language issues, with standardized catalog descriptions. The proposed approach, TPDR, addresses this by modeling the problem as a ranking task, employing a two-step retrieval strategy to balance semantic and syntactic matching using transformer architectures.

Model Architecture and Approach

TPDR leverages transformers as dual encoders, creating shared embedding vector spaces for client specifications (IS) and standardized descriptions (SD). By applying attention and contrastive learning, TPDR encodes these descriptions such that pairs that are relevant are nearby in the vector space (Figure 1). This is achieved through a novel training scheme, called TaG-Training, that alternates encoder optimizations in a manner analogous to a cooperative turn-based strategy, enhancing both effectiveness and efficiency. Figure 1

Figure 1: Proposed Model.

In the retrieval phase, potential products are first selected based on their semantic proximity to the query using Hierarchical Navigable Small World graphs to fetch nearest neighbors efficiently. To address the gap where semantic retrieval may overlook specific product features, a re-ranking phase employs syntactic metrics, including cosine similarity, Jaccard index, and BM25 scoring.

TPDR Methodology and Implementation

The training process optimizes an N-pair loss function, refining embeddings to increase inter-cluster distance among non-relevant items while condensing distance within relevant clusters (Figure 2). This loss mechanism enables TPDR to process queries in batch sizes with multiple negative samples, enhancing convergence and retrieval precision. Figure 2

Figure 2: Embeddings of a model trained to satisfy the constraints of triplet loss (bottom) and N-pair loss (top).

Re-ranking combines contextual scores from transformers with traditional IR metrics, producing a hybrid approach that improves final ranking alignment with true product descriptors. The TPDR experiments utilized robust computational settings with high-capacity GPUs, performing extensive iterations over hundreds of thousands of product descriptions.

Evaluation and Results

Empirical evaluation across datasets (D1 to D11) revealed TPDR's significant improvement over baseline models. Semantic and syntactic components ensured that relevant descriptions appeared in the upper tiers of ranking outputs. Notably, accuracy improvements up to 3.76 times over syntactic-only models and 2.3 times over semantic-only models showcase the efficacy of TPDR’s integrated approach.

Ranking distribution illustrated in Figure 3 shows that TPDR successfully retrieves relevant SDs by the 5th position in 71% of cases and within the top 100 in 85%. The DP association step, critical for manual disambiguation, exhibited high accuracy, with correct DPs retrieved in top positions for the majority of queries. Figure 3

Figure 3: Ranking Distribution of Product Description Position.

Conclusion

TPDR presents a compelling method for tackling the intricate problem of product description standardization, combining advanced language modeling techniques with practical retrieval strategies. Its two-tiered approach, grounded in strong empirical results, makes it an attractive solution in contexts where precise product matching is pivotal.

Applying this method could significantly streamline processes in procurement and logistics sectors, suggesting promising avenues for further enhancing retrieval mechanisms and expanding applicability across diverse multilingual datasets. Future work will explore other transformer variants, refine re-ranking functions, and improve initial retrieval recall to extend TPDR’s proficiency.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.