Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

eCeLLM: Generalizing Large Language Models for E-commerce from Large-scale, High-quality Instruction Data (2402.08831v2)

Published 13 Feb 2024 in cs.CL, cs.AI, and cs.IR

Abstract: With tremendous efforts on developing effective e-commerce models, conventional e-commerce models show limited success in generalist e-commerce modeling, and suffer from unsatisfactory performance on new users and new products - a typical out-of-domain generalization challenge. Meanwhile, LLMs demonstrate outstanding performance in generalist modeling and out-of-domain generalizability in many fields. Toward fully unleashing their power for e-commerce, in this paper, we construct ECInstruct, the first open-sourced, large-scale, and high-quality benchmark instruction dataset for e-commerce. Leveraging ECInstruct, we develop eCeLLM, a series of e-commerce LLMs, by instruction-tuning general-purpose LLMs. Our comprehensive experiments and evaluation demonstrate that eCeLLM models substantially outperform baseline models, including the most advanced GPT-4, and the state-of-the-art task-specific models in in-domain evaluation. Moreover, eCeLLM exhibits excellent generalizability to out-of-domain settings, including unseen products and unseen instructions, highlighting its superiority as a generalist e-commerce model. Both the ECInstruct dataset and the eCeLLM models show great potential in empowering versatile and effective LLMs for e-commerce. ECInstruct and eCeLLM models are publicly accessible through https://ninglab.github.io/eCeLLM.

An Exploration into E-commerce Task Instruction Tuning

The paper discusses the development and effectiveness of instruction-tuned models for addressing task-specific challenges in e-commerce environments. The authors provide a comprehensive dataset of tasks split into four categories: Product Understanding, User Understanding, Query Product Matching, and Product Question Answering. Each category comprises well-defined subtasks designed to enhance the performance of LLMs in specific e-commerce scenarios.

Methodology

The methodology involves characterizing tasks with structured data and employing LLMs to achieve high accuracy through both in-domain (IND) and out-of-domain (OOD) evaluations. Tasks like attribute value extraction, sentiment analysis, and product matching form the benchmarks, using metrics such as precision, recall, F1 scores, and NDCG to assess model performance.

Data Processing

The dataset includes extensive preprocessing measures for both IND and OOD evaluations, encompassing the Amazon Review dataset, Amazon-Google Product data, and the Shopping Queries dataset. The raw datasets were split with an 8:1:1 ratio for training, validation, and test sets, respectively. A critical aspect of the data processing was downsampling for efficiency, allowing the models to evaluate more effectively within the constraints of computational resources.

Instruction Design

A notable element is the utilization of generated and unseen instructions during training. The multiple instructions per task, including unseen ones, highlight the versatility and adaptability of the models. Such comprehensive design ensures that the LLMs are not only evaluated on explicit instructions but also on those they have not encountered, testing their generalization capabilities.

Results and Analysis

The models exhibited superior performance when specifically tuned on individual and combined task datasets. Notably, the Llama-2 13B-chat model demonstrated its utility as a robust base model for instruction-tuned tasks. Mistral-7B Instruct-v0.2 and Phi-2 were identified as effective for particular tasks, reflecting the importance of selecting an appropriate base model for domain-specific challenges.

In-domain Evaluation:

  • Attribute Value Extraction: Models achieved an F1* score of up to 0.595.
  • Product Relation Prediction: The macro F1 score reached ~0.502.
  • Sentiment Analysis: Macro F1 improved by incorporating more comprehensive tuning datasets.

Out-of-domain Evaluation:

The task-specific fine-tuning and general instruction tuning both displayed commendable results, although task-specific fine-tuning offered slightly better results on average. The instruction-tuned LLMs exceeded the capabilities of some SoTA task-specific models, notably in generalization to unseen domains.

Implications and Future Directions

This work has significant implications for the application of LLMs in practical e-commerce settings. The refined models can enhance user interaction by accurately understanding products, user sentiment, and optimizing query-product relationships. The structured approach offers a blueprint for task specialization within AI models.

Further exploration could involve enhancing dataset diversity and tuning processes to mitigate models' biases. The improvement of fine-grained control in tasks, such as query substitution, could also be pursued. Moreover, future work could investigate the interplay between different task categories to synergize LLM capabilities further, which could enrich the model's understanding of multi-faceted e-commerce scenarios.

In conclusion, this work exemplifies the promise of instruction tuning in specialized domains, showing marked improvements over general-purpose models across a variety of tasks. As more datasets and refined LLM architectures develop, it is feasible to anticipate increasingly sophisticated models that could redefine interaction models in e-commerce and beyond.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. A graph neural network approach for product relationship prediction. In International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, volume 85383, pp.  V03AT03A036. American Society of Mechanical Engineers, 2021.
  2. Anthropic. Model card and evaluations for claude models. CoRR, 2023. URL https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf.
  3. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  4. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  5. Unsupervised cross-lingual representation learning at scale. In Annual Meeting of the Association for Computational Linguistics, 2019. URL https://api.semanticscholar.org/CorpusID:207880568.
  6. Product question answering in e-commerce: A survey. arXiv preprint arXiv:2302.08092, 2023.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019. URL https://api.semanticscholar.org/CorpusID:52967399.
  8. Zero-shot recommender systems. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022.
  9. Large language models for mathematicians. arXiv preprint arXiv:2312.04556, 2023.
  10. DataComp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
  11. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (P5). In Proceedings of the 16th ACM Conference on Recommender Systems, pp.  299–315, 2022.
  12. AmazonQA: A review-based question answering task. In International Joint Conference on Artificial Intelligence, 2019. URL https://api.semanticscholar.org/CorpusID:199465954.
  13. DeBERTaV3: Improving DeBERTa using electra-style pre-training with gradient-disentangled embedding sharing. The Eleventh International Conference on Learning Representations, 2022.
  14. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  15. Towards universal sequence representation learning for recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  585–593, 2022.
  16. Large language models are zero-shot rankers for recommender systems. arXiv preprint arXiv:2305.08845, 2023.
  17. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  18. Phi-2: The surprising power of small language models. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/, 2023.
  19. Mistral 7B. arXiv preprint arXiv:2310.06825, 2023.
  20. Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, 3(1-2):484–493, 2010.
  21. Platypus: Quick, cheap, and powerful refinement of LLMs. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023a.
  22. Beyond Scale: the diversity coefficient as a data quality metric demonstrates LLMs are pre-trained on formally diverse data. ArXiv, abs/2306.13840, 2023b. URL https://api.semanticscholar.org/CorpusID:259252412.
  23. Text is all you need: Learning language representations for sequential recommendation. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023a.
  24. EcomGPT: Instruction-tuning large language model with chain-of-task tasks for e-commerce. arXiv preprint arXiv:2308.06966, 2023b.
  25. Facing the cold start problem in recommender systems. Expert Systems with Applications, 41(4, Part 2):2065–2073, 2014. ISSN 0957-4174. doi: https://doi.org/10.1016/j.eswa.2013.09.005. URL https://www.sciencedirect.com/science/article/pii/S0957417413007240.
  26. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  27. The Flan Collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning, 2023. URL https://api.semanticscholar.org/CorpusID:256415991.
  28. BERTweet: A pre-trained language model for english tweets. In Conference on Empirical Methods in Natural Language Processing, 2020. URL https://api.semanticscholar.org/CorpusID:218719869.
  29. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/arXiv.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  30. gSASRec: Reducing overconfidence in sequential recommendation trained with negative sampling. In Proceedings of the 17th ACM Conference on Recommender Systems, pp.  116–128, 2023.
  31. Rahm, Erhard. Benchmark datasets for entity resolution. https://dbs.uni-leipzig.de/research/projects/benchmark-datasets-for-entity-resolution, 2010.
  32. Shopping queries dataset: A large-scale ESCI benchmark for improving product search. arXiv preprint arXiv:2206.06588, 2022.
  33. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2021.
  34. Modeling relational data with graph convolutional networks. In 15th International Conference on Extended Semantic Web Conference, ESWC 2018, pp.  593–607. Springer, 2018.
  35. LLaMA-E: Empowering e-commerce authoring with multi-aspect instruction following. arXiv preprint arXiv:2308.04913, 2023.
  36. Comparing traditional and LLM-based search for consumer choice: A randomized experiment. arXiv preprint arXiv:2307.03744, 2023.
  37. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  38. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  39. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b. doi: 10.48550/arXiv.2307.09288. URL https://doi.org/10.48550/arXiv.2307.09288.
  40. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  41. Learning to extract attribute value from product via question answering: A multi-task approach. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp.  47–55, 2020.
  42. Self-Instruct: Aligning language models with self-generated instructions. In Annual Meeting of the Association for Computational Linguistics, 2022a. URL https://api.semanticscholar.org/CorpusID:254877310.
  43. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Conference on Empirical Methods in Natural Language Processing, 2022b. URL https://api.semanticscholar.org/CorpusID:253098274.
  44. RecMind: Large language model powered agent for recommendation. arXiv preprint arXiv:2308.14296, 2023.
  45. A survey on sentiment analysis methods, applications, and challenges. Artificial Intelligence Review, 55(7):5731–5780, 2022.
  46. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021.
  47. Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  48. Product knowledge graph embedding for e-commerce. In Proceedings of the 13th international conference on web search and data mining, pp.  672–680, 2020.
  49. Scaling up open tagging from tens to thousands: Comprehension empowered attribute value extraction from product title. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  5214–5223, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1514. URL https://aclanthology.org/P19-1514.
  50. MultiInstruct: Improving multi-modal zero-shot learning via instruction tuning. In Annual Meeting of the Association for Computational Linguistics, 2022. URL https://api.semanticscholar.org/CorpusID:254926784.
  51. MAVE: A product dataset for multi-source attribute value extraction. In Proceedings of the fifteenth ACM international conference on web search and data mining, pp.  1256–1265, 2022.
  52. Large language model can interpret latent space of sequential recommender. arXiv preprint arXiv:2310.20487, 2023.
  53. MAmmoTH: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
  54. Large language models for robotics: A survey. arXiv preprint arXiv:2311.07226, 2023.
  55. Is ChatGPT fair for recommendation? evaluating fairness in large language model recommendation. Proceedings of the 17th ACM Conference on Recommender Systems, 2023a. URL https://api.semanticscholar.org/CorpusID:258676079.
  56. Advances and challenges of multi-task learning method in recommender system: A survey. CoRR, abs/2305.13843, 2023b. doi: 10.48550/ARXIV.2305.13843. URL https://doi.org/10.48550/arXiv.2305.13843.
  57. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  58. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, pp.  507–517, 2016.
  59. Text is all you need: Learning language representations for sequential recommendation. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023.
  60. Image-based recommendations on styles and substitutes. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp.  43–52, 2015.
  61. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp.  188–197, 2019.
  62. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Conference on Empirical Methods in Natural Language Processing, 2019. URL https://api.semanticscholar.org/CorpusID:201646309.
  63. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7881–7892, 2020.
  64. A theoretical analysis of NDCG type ranking measures. In Conference on learning theory, pp.  25–54. PMLR, 2013.
  65. BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Bo Peng (304 papers)
  2. Xinyi Ling (2 papers)
  3. Ziru Chen (20 papers)
  4. Huan Sun (88 papers)
  5. Xia Ning (48 papers)
Citations (8)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub