Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pretraining and Updates of Domain-Specific LLM: A Case Study in the Japanese Business Domain (2404.08262v3)

Published 12 Apr 2024 in cs.CL and cs.AI

Abstract: The development of LLMs in various languages has been advancing, but the combination of non-English languages with domain-specific contexts remains underexplored. This paper presents our findings from training and evaluating a Japanese business domain-specific LLM designed to better understand business-related documents, such as the news on current affairs, technical reports, and patents. Additionally, LLMs in this domain require regular updates to incorporate the most recent knowledge. Therefore, we also report our findings from the first experiments and evaluations involving updates to this LLM using the latest article data, which is an important problem setting that has not been addressed in previous research. From our experiments on a newly created benchmark dataset for question answering in the target domain, we found that (1) our pretrained model improves QA accuracy without losing general knowledge, and (2) a proper mixture of the latest and older texts in the training data for the update is necessary. Our pretrained model and business domain benchmark are publicly available to support further studies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare Voss (eds.), Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp.  65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. URL https://aclanthology.org/W05-0909.
  2. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP2019), pp.  3615–3620, Hong Kong, China, 2019. Association for Computational Linguistics. URL https://aclanthology.org/D19-1371.
  3. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  4. dbmdz. German gpt-2 model. Online, 2023. URL https://huggingface.co/dbmdz/german-gpt2.
  5. The pile: An 800gb dataset of diverse text for language modeling, 2020.
  6. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  7. Lora: Low-rank adaptation of large language models, 2021. URL http://arxiv.org/pdf/2106.09685.
  8. Pretraining language-and domain-specific bert on automatically translated text. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pp.  548–555, 2023.
  9. The stack: 3 tb of permissively licensed source code, 2022.
  10. JGLUE: Japanese general language understanding evaluation. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp.  2957–2966, Marseille, France, June 2022. European Language Resources Association. URL https://aclanthology.org/2022.lrec-1.317.
  11. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, Sep 2019. ISSN 1460-2059. doi: 10.1093/bioinformatics/btz682. URL http://dx.doi.org/10.1093/bioinformatics/btz682.
  12. Domain specialization as the key to make large language models disruptive: A comprehensive survey, 2023.
  13. OpenAI. Openai api. Online. URL https://openai.com/.
  14. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp.  311–318. Association for Computational Linguistics, 2002.
  15. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023.
  16. Bloom: A 176b-parameter open-access multilingual language model, 2023.
  17. Fine-tuned language models are continual learners. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  6107–6122, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.410. URL https://aclanthology.org/2022.emnlp-main.410.
  18. Effectiveness of pre-trained language models for the japanese winograd schema challenge. Journal of Advanced Computational Intelligence and Intelligent Informatics, 27(3):511–521, 2023. doi: 10.20965/jaciii.2023.p0511.
  19. tokyotech llm. Swallow. Online, 2023. URL https://tokyotech-llm.github.io/swallow-llama.
  20. Llama 2: Open foundation and fine-tuned chat models, 2023.
  21. Extending multilingual BERT to low-resource languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  2649–2656, Online, November 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.findings-emnlp.240.
  22. Bloomberggpt: A large language model for finance, 2023.
  23. Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=SkeHuCVFDr.
  24. Cpm: A large-scale generative chinese pre-trained language model. AI Open, 2:93–99, 2021.
  25. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Kosuke Takahashi (6 papers)
  2. Takahiro Omi (7 papers)
  3. Kosuke Arima (2 papers)
  4. Tatsuya Ishigaki (4 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com