Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EcomGPT-CT: Continual Pre-training of E-commerce Large Language Models with Semi-structured Data (2312.15696v1)

Published 25 Dec 2023 in cs.CL

Abstract: LLMs pre-trained on massive corpora have exhibited remarkable performance on various NLP tasks. However, applying these models to specific domains still poses significant challenges, such as lack of domain knowledge, limited capacity to leverage domain knowledge and inadequate adaptation to domain-specific data formats. Considering the exorbitant cost of training LLMs from scratch and the scarcity of annotated data within particular domains, in this work, we focus on domain-specific continual pre-training of LLMs using E-commerce domain as an exemplar. Specifically, we explore the impact of continual pre-training on LLMs employing unlabeled general and E-commercial corpora. Furthermore, we design a mixing strategy among different data sources to better leverage E-commercial semi-structured data. We construct multiple tasks to assess LLMs' few-shot In-context Learning ability and their zero-shot performance after instruction tuning in E-commerce domain. Experimental results demonstrate the effectiveness of continual pre-training of E-commerce LLMs and the efficacy of our devised data mixing strategy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  3. The jddc corpus: A large-scale multi-turn chinese dialogue dataset for e-commerce customer service. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 459–466.
  4. Towards knowledge-based personalized product description generation in e-commerce. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3040–3050.
  5. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  6. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092.
  7. A span-extraction dataset for Chinese machine reading comprehension. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5883–5889, Hong Kong, China. Association for Computational Linguistics.
  8. Construction and applications of billion-scale pre-trained multimodal business knowledge graph. In 2023 IEEE 39th International Conference on Data Engineering (ICDE), pages 2988–3002. IEEE.
  9. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  10. Automatic generation of chinese short product titles for mobile display. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9460–9465.
  11. Train no evil: Selective masking for task-guided pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6966–6974, Online. Association for Computational Linguistics.
  12. Continual pre-training of large language models: How to re-warm your model? In Workshop on Efficient Systems for Foundation Models@ ICML2023.
  13. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Association for Computational Linguistics.
  14. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  15. OCNLI: Original Chinese Natural Language Inference. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3512–3526, Online. Association for Computational Linguistics.
  16. Lawyer llama technical report. arXiv preprint arXiv:2305.15062.
  17. Lateval: An interactive llms evaluation benchmark with incomplete information from lateral thinking puzzles. arXiv preprint arXiv:2308.10855.
  18. Better modeling of incomplete annotations for named entity recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 729–734, Minneapolis, Minnesota. Association for Computational Linguistics.
  19. A study of bfloat16 for deep learning training. arXiv preprint arXiv:1905.12322.
  20. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  21. Continual pre-training of language models. In The Eleventh International Conference on Learning Representations.
  22. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857.
  23. Aspect-aware multimodal summarization for chinese e-commerce products. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8188–8195.
  24. Ecomgpt: Instruction-tuning large language model with chain-of-task tasks for e-commerce. arXiv preprint arXiv:2308.06966.
  25. On the (in) effectiveness of large language models for chinese text correction. arXiv preprint arXiv:2307.09007.
  26. Learning from the dictionary: Heterogeneous knowledge guided fine-tuning for Chinese spell checking. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 238–249, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  27. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  28. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  29. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland. Association for Computational Linguistics.
  30. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
  31. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  32. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only. ArXiv, abs/2306.01116.
  33. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  34. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  35. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
  36. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
  37. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
  38. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  39. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617.
  40. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  41. Huatuo: Tuning llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975.
  42. Chathome: Development and evaluation of a domain-specific language model for home renovation. arXiv preprint arXiv:2307.15290.
  43. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  44. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564.
  45. CLUE: A Chinese language understanding evaluation benchmark. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4762–4772, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  46. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
  47. Fingpt: Open-source financial large language models. arXiv preprint arXiv:2306.06031.
  48. Seqgpt: An out-of-the-box large language model for open domain sequence understanding. arXiv preprint arXiv:2308.10529.
  49. Wudaocorpora: A super large-scale chinese corpora for pre-training language models. AI Open, 2:65–68.
  50. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070.
  51. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.
  52. A survey of large language models. arXiv preprint arXiv:2303.18223.
  53. Multimodal joint attribute prediction and value extraction for E-commerce product. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2129–2139, Online. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Shirong Ma (23 papers)
  2. Shen Huang (25 papers)
  3. Shulin Huang (12 papers)
  4. Xiaobin Wang (39 papers)
  5. Yangning Li (49 papers)
  6. Hai-Tao Zheng (94 papers)
  7. Pengjun Xie (85 papers)
  8. Fei Huang (408 papers)
  9. Yong Jiang (194 papers)
Citations (6)
Youtube Logo Streamline Icon: https://streamlinehq.com