Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-commerce Product Attribute Value Extraction (2403.00863v2)

Published 29 Feb 2024 in cs.IR, cs.AI, and cs.CL

Abstract: Product attribute value extraction is a pivotal component in NLP and the contemporary e-commerce industry. The provision of precise product attribute values is fundamental in ensuring high-quality recommendations and enhancing customer satisfaction. The recently emerging LLMs have demonstrated state-of-the-art performance in numerous attribute extraction tasks, without the need for domain-specific training data. Nevertheless, varying strengths and weaknesses are exhibited by different LLMs due to the diversity in data, architectures, and hyperparameters. This variation makes them complementary to each other, with no single LLM dominating all others. Considering the diverse strengths and weaknesses of LLMs, it becomes necessary to develop an ensemble method that leverages their complementary potentials. In this paper, we propose a novel algorithm called LLM-ensemble to ensemble different LLMs' outputs for attribute value extraction. We iteratively learn the weights for different LLMs to aggregate the labels with weights to predict the final attribute value. Not only can our proposed method be proven theoretically optimal, but it also ensures efficient computation, fast convergence, and safe deployment. We have also conducted extensive experiments with various state-of-the-art LLMs, including Llama2-13B, Llama2-70B, PaLM-2, GPT-3.5, and GPT-4, on Walmart's internal data. Our offline metrics demonstrate that the LLM-ensemble method outperforms all the state-of-the-art single LLMs on Walmart's internal dataset. This method has been launched in several production models, leading to improved Gross Merchandise Volume (GMV), Click-Through Rate (CTR), Conversion Rate (CVR), and Add-to-Cart Rate (ATC).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Large language models are few-shot clinical information extractors. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 1998–2022.
  2. Generative Models for Product Attribute Extraction. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track. 575–585.
  3. Product Attribute Value Extraction using Large Language Models. arXiv preprint arXiv:2310.12537 (2023).
  4. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  5. Knowledge Graph Completion Models are Few-shot Learners: An Empirical Study of Relation Labeling in E-commerce with LLMs. arXiv preprint arXiv:2305.09858 (2023).
  6. Label augmented and weighted majority voting for crowdsourcing. Information Sciences 606 (2022), 397–409.
  7. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
  8. Alexander Philip Dawid and Allan M Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics) 28, 1 (1979), 20–28.
  9. Thomas G Dietterich. 2000. Ensemble methods in machine learning. In International workshop on multiple classifier systems. Springer, 1–15.
  10. Text mining for product attribute extraction. ACM SIGKDD Explorations Newsletter 8, 1 (2006), 41–48.
  11. Applied logistic regression. Vol. 398. John Wiley & Sons.
  12. Large language models are zero-shot rankers for recommender systems. arXiv preprint arXiv:2305.08845 (2023).
  13. LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion. arXiv preprint arXiv:2306.02561 (2023).
  14. Txtract: Taxonomy-aware knowledge extraction for thousands of product categories. arXiv preprint arXiv:2004.13852 (2020).
  15. Recognizing salient entities in shopping queries. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 107–111.
  16. Hongwei Li and Bin Yu. 2014. Error rate bounds and iterative weighted majority voting for crowdsourcing. arXiv preprint arXiv:1411.4086 (2014).
  17. LLM-TAKE: Theme-Aware Keyword Extraction Using Large Language Models. In 2023 IEEE International Conference on Big Data (BigData). IEEE, 4318–4324.
  18. Ajinkya More. 2016. Attribute extraction from product titles in ecommerce. arXiv preprint arXiv:1608.04670 (2016).
  19. GENEVA: Benchmarking Generalizability for Event Argument Extraction with Hundreds of Event Types and Argument Roles. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 3664–3686.
  20. Duangmanee Putthividhya and Junling Hu. 2011. Bootstrapped named entity recognition for product attribute extraction. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. 1557–1567.
  21. Accurate product attribute extraction on the field. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1862–1873.
  22. Tian Tian and Jun Zhu. 2015. Max-margin majority voting for learning from crowds. Advances in neural information processing systems 28 (2015).
  23. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  24. Faceted product search powered by the semantic web. Decision Support Systems 53, 3 (2012), 425–437.
  25. Knowledge fusion of large language models. arXiv preprint arXiv:2401.10491 (2024).
  26. Learning to extract attribute value from product via question answering: A multi-task approach. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 47–55.
  27. Code4struct: Code generation for few-shot event structure prediction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 3640–3663.
  28. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022).
  29. Scalable attribute-value extraction from semi-structured text. In 2009 IEEE international conference on data mining workshops. IEEE, 302–307.
  30. Scaling up open tagging from tens to thousands: Comprehension empowered attribute value extraction from product title. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5214–5223.
  31. AdaTag: Multi-attribute value extraction from product profiles with adaptive decoding. arXiv preprint arXiv:2106.02318 (2021).
  32. Mixpave: Mix-prompt tuning for few-shot product attribute value extraction. In Findings of the Association for Computational Linguistics: ACL 2023. 9978–9991.
  33. MAVE: A product dataset for multi-source attribute value extraction. In Proceedings of the fifteenth ACM international conference on web search and data mining. 1256–1265.
  34. A Framework for an Ontology-based E-commerce Product Information Retrieval System. J. Comput. 4, 6 (2009), 436–443.
  35. Oa-mine: open-world attribute mining for e-commerce products with weak supervision. In Proceedings of the ACM Web Conference 2022. 3153–3161.
  36. Opentag: Open attribute value extraction from product profiles. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1049–1058.
Citations (21)

Summary

  • The paper presents a novel LLM-ensemble method that iteratively learns weights to combine diverse LLM outputs for enhanced product attribute extraction.
  • It builds on the Dawid-Skene model to optimize prediction accuracy, achieving 2.36%-2.76% improvements over leading standalone models on Walmart datasets.
  • Online A/B tests demonstrate statistically significant lifts in GMV, CTR, CVR, and ATC, underscoring its practical impact in e-commerce.

The paper introduces a novel LLM-ensemble method for product attribute value extraction in e-commerce applications, addressing the challenge of leveraging the diverse strengths of multiple LLMs. The LLM-ensemble algorithm iteratively learns weights for different LLMs to aggregate their outputs, aiming for optimal attribute value prediction. The method is validated through experiments on Walmart's internal datasets, demonstrating superior performance compared to individual state-of-the-art LLMs. The successful deployment of this method in production models has led to improvements in key e-commerce metrics.

The paper highlights the importance of accurate product attribute value extraction for enhancing recommendation quality and customer satisfaction in the e-commerce sector. While LLMs have shown promise in attribute extraction, their varying performance due to differences in data, architectures, and hyperparameters necessitates an ensemble approach. The LLM-ensemble method is inspired by crowdsourcing techniques, treating each LLM as an individual worker and iteratively refining predictions to assign weights based on demonstrated accuracy.

The LLM-ensemble algorithm builds upon the Dawid-Skene Model, a structured latent variable model. The problem is defined as extracting attribute values Vp={vp,1,,vp,q}\mathcal{V}_p = \{v_{p,1}, \cdots, v_{p,q}\} from unstructured text data T={t1,,tp:pP}\mathcal{T} = \{t_1, \cdots, t_p: p \in \mathcal{P}\} for a set of products P\mathcal{P} and attributes Q={q1,,qm}\mathcal{Q} = \{q_1, \cdots, q_m \}, where vp,qv_{p,q} is selected from a set of pre-defined labels Lq\mathcal{L}_q.

Key notations include:

  • yqpy_{qp}: ground-truth label of the pp-th product for attribute qq.
  • y^qp\hat{y}_{qp}: predicted label for the pp-th product and attribute qq by an LLM.
  • WLqN×PˉW \in \bar{\mathcal{L}_q^{N \times P}}: input data matrix where N=NN = |\mathcal{N}| is the number of LLMs, P=PP = |\mathcal{P}| is the number of products, and Lq=Lq{0}ˉ\bar{\mathcal{L}_q = \mathcal{L}_q \cup \{0\}} is the extended label set including missing values. WqijW_{qij} represents the label provided by the ii-th LLM to the jj-th product for attribute qq.
  • T=(Tqij)N×PT = (T_{qij})_{N \times P}: indicator matrix for attribute qq, where Tqij=1T_{qij} = 1 if entry (i,j)(i, j) is observed, and Tqij=0T_{qij} = 0 if unobserved.

The LLM-ensemble algorithm, as described in Algorithm 1, iteratively learns weights vqiv_{qi} for each LLM. The algorithm predicts the attribute values {y^q1,,y^qP}\{\hat{y}_{q1}, \cdots, \hat{y}_{qP}\} for attribute qq. The predicted label y^qj\hat{y}_{qj} is determined by arg maxi=1NvqiI(Wqij=k)\argmax \sum_{i=1}^N v_{qi}I(W_{qij}=k), where II is the indicator function. The accuracy α^qi\hat{\alpha}_{qi} of each LLM is calculated as j=1PI(Wqij=y^qj)j=1PTqij\frac{\sum_{j=1}^PI(W_{qij} = \hat{y}_{qj})}{\sum_{j=1}^PT_{qij}}. The weight vqiv_{qi} is then updated using α^qi\hat{\alpha}_{qi}. The final predictions are obtained by arg maxkLqi=1NvqiI(Wij=k)\argmax_{k\in \mathcal{L}_q}\sum_{i=1}^N v_{qi}I(W_{ij}=k).

Theoretically, the algorithm approximates the oracle Maximum A Posteriori (MAP) rule, optimizing the mean error rate of weighted majority voting of LLMs. This allows the method to assign higher weights to more accurate LLMs and mitigate the influence of less accurate ones.

The paper presents comparison experiments using datasets Walmart-Age and Walmart-Gender, containing 20M items each. The LLM-ensemble is compared against individual LLMs such as Llama2-13B, Llama2-70B, PaLM-2, GPT-3.5, and GPT-4, as well as logistic regression and a rule-based method. The LLM-ensemble achieves the best performance, with a 2.36% improvement on Walmart-Age and a 2.76% improvement on Walmart-Gender compared to the second-best model, GPT-4.

Online A/B tests on Walmart's similar item recommendation model demonstrate statistically significant improvements across key e-commerce metrics:

  • Gross Merchandise Volume (GMV): 0.38% lift (p<0.05p < 0.05)
  • Click-Through Rate (CTR): 2.16% lift (p<0.05p < 0.05)
  • Conversion Rate (CVR): 0.26% lift (p<0.05p < 0.05)
  • Add-to-Cart Rate (ATC): 1.42% lift (p<0.05p < 0.05)

These results validate the effectiveness of the LLM-ensemble methodology, leading to its deployment on Walmart's online platform.

X Twitter Logo Streamline Icon: https://streamlinehq.com