LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-commerce Product Attribute Value Extraction (2403.00863v2)

Published 29 Feb 2024 in cs.IR, cs.AI, and cs.CL

Abstract: Product attribute value extraction is a pivotal component in NLP and the contemporary e-commerce industry. The provision of precise product attribute values is fundamental in ensuring high-quality recommendations and enhancing customer satisfaction. The recently emerging LLMs have demonstrated state-of-the-art performance in numerous attribute extraction tasks, without the need for domain-specific training data. Nevertheless, varying strengths and weaknesses are exhibited by different LLMs due to the diversity in data, architectures, and hyperparameters. This variation makes them complementary to each other, with no single LLM dominating all others. Considering the diverse strengths and weaknesses of LLMs, it becomes necessary to develop an ensemble method that leverages their complementary potentials. In this paper, we propose a novel algorithm called LLM-ensemble to ensemble different LLMs' outputs for attribute value extraction. We iteratively learn the weights for different LLMs to aggregate the labels with weights to predict the final attribute value. Not only can our proposed method be proven theoretically optimal, but it also ensures efficient computation, fast convergence, and safe deployment. We have also conducted extensive experiments with various state-of-the-art LLMs, including Llama2-13B, Llama2-70B, PaLM-2, GPT-3.5, and GPT-4, on Walmart's internal data. Our offline metrics demonstrate that the LLM-ensemble method outperforms all the state-of-the-art single LLMs on Walmart's internal dataset. This method has been launched in several production models, leading to improved Gross Merchandise Volume (GMV), Click-Through Rate (CTR), Conversion Rate (CVR), and Add-to-Cart Rate (ATC).

References (36)

Citations (21)

View on Semantic Scholar

Summary

The paper presents a novel LLM-ensemble method that iteratively learns weights to combine diverse LLM outputs for enhanced product attribute extraction.
It builds on the Dawid-Skene model to optimize prediction accuracy, achieving 2.36%-2.76% improvements over leading standalone models on Walmart datasets.
Online A/B tests demonstrate statistically significant lifts in GMV, CTR, CVR, and ATC, underscoring its practical impact in e-commerce.

The paper introduces a novel LLM-ensemble method for product attribute value extraction in e-commerce applications, addressing the challenge of leveraging the diverse strengths of multiple LLMs. The LLM-ensemble algorithm iteratively learns weights for different LLMs to aggregate their outputs, aiming for optimal attribute value prediction. The method is validated through experiments on Walmart's internal datasets, demonstrating superior performance compared to individual state-of-the-art LLMs. The successful deployment of this method in production models has led to improvements in key e-commerce metrics.

The paper highlights the importance of accurate product attribute value extraction for enhancing recommendation quality and customer satisfaction in the e-commerce sector. While LLMs have shown promise in attribute extraction, their varying performance due to differences in data, architectures, and hyperparameters necessitates an ensemble approach. The LLM-ensemble method is inspired by crowdsourcing techniques, treating each LLM as an individual worker and iteratively refining predictions to assign weights based on demonstrated accuracy.

The LLM-ensemble algorithm builds upon the Dawid-Skene Model, a structured latent variable model. The problem is defined as extracting attribute values $\mathcal{V}_p = \{v_{p,1}, \cdots, v_{p,q}\}$ from unstructured text data $\mathcal{T} = \{t_1, \cdots, t_p: p \in \mathcal{P}\}$ for a set of products $\mathcal{P}$ and attributes $\mathcal{Q} = \{q_1, \cdots, q_m \}$ , where $v_{p,q}$ is selected from a set of pre-defined labels $\mathcal{L}_q$ .

Key notations include:

$y_{qp}$ : ground-truth label of the $p$ -th product for attribute $q$ .
$\hat{y}_{qp}$ : predicted label for the $p$ -th product and attribute $q$ by an LLM.
$W \in \bar{\mathcal{L}_q^{N \times P}}$ : input data matrix where $N = |\mathcal{N}|$ is the number of LLMs, $P = |\mathcal{P}|$ is the number of products, and $\bar{\mathcal{L}_q = \mathcal{L}_q \cup \{0\}}$ is the extended label set including missing values. $W_{qij}$ represents the label provided by the $i$ -th LLM to the $j$ -th product for attribute $q$ .
$T = (T_{qij})_{N \times P}$ : indicator matrix for attribute $q$ , where $T_{qij} = 1$ if entry $(i, j)$ is observed, and $T_{qij} = 0$ if unobserved.

The LLM-ensemble algorithm, as described in Algorithm 1, iteratively learns weights $v_{qi}$ for each LLM. The algorithm predicts the attribute values $\{\hat{y}_{q1}, \cdots, \hat{y}_{qP}\}$ for attribute $q$ . The predicted label $\hat{y}_{qj}$ is determined by $\argmax \sum_{i=1}^N v_{qi}I(W_{qij}=k)$ , where $I$ is the indicator function. The accuracy $\hat{\alpha}_{qi}$ of each LLM is calculated as $\frac{\sum_{j=1}^PI(W_{qij} = \hat{y}_{qj})}{\sum_{j=1}^PT_{qij}}$ . The weight $v_{qi}$ is then updated using $\hat{\alpha}_{qi}$ . The final predictions are obtained by $\argmax_{k\in \mathcal{L}_q}\sum_{i=1}^N v_{qi}I(W_{ij}=k)$ .

Theoretically, the algorithm approximates the oracle Maximum A Posteriori (MAP) rule, optimizing the mean error rate of weighted majority voting of LLMs. This allows the method to assign higher weights to more accurate LLMs and mitigate the influence of less accurate ones.

The paper presents comparison experiments using datasets Walmart-Age and Walmart-Gender, containing 20M items each. The LLM-ensemble is compared against individual LLMs such as Llama2-13B, Llama2-70B, PaLM-2, GPT-3.5, and GPT-4, as well as logistic regression and a rule-based method. The LLM-ensemble achieves the best performance, with a 2.36% improvement on Walmart-Age and a 2.76% improvement on Walmart-Gender compared to the second-best model, GPT-4.

Online A/B tests on Walmart's similar item recommendation model demonstrate statistically significant improvements across key e-commerce metrics:

Gross Merchandise Volume (GMV): 0.38% lift ( $p < 0.05$ )
Click-Through Rate (CTR): 2.16% lift ( $p < 0.05$ )
Conversion Rate (CVR): 0.26% lift ( $p < 0.05$ )
Add-to-Cart Rate (ATC): 1.42% lift ( $p < 0.05$ )

These results validate the effectiveness of the LLM-ensemble methodology, leading to its deployment on Walmart's online platform.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_reachsumit/status/1765269096172839005