- The paper presents the LLM-jp project that creates fully open Japanese LLMs by integrating contributions from academia and industry.
- It details innovative corpus building, model construction, and fine-tuning methodologies using extensive datasets and cloud computing.
- It improves safety and evaluation frameworks by incorporating culturally specific measures and benchmarking tools for Japanese NLP.
LLM-jp: A Cross-Organizational Project for the Research and Development of Fully Open Japanese LLMs
This paper delineates an extensive cross-organizational project, LLM-jp, aimed at the development and research of open-source Japanese LLMs. The project is noted for its large-scale participation, with over 1,500 contributors from both academia and industry. The collaborative nature and transparency in methodologies and resources make LLM-jp a significant milestone in the field of Japanese NLP.
Introduction
The primary impetus for LLM-jp is to address the underrepresentation of Japanese datasets in prominent LLMs like GPT-3 and GPT-4, where Japanese only constitutes 0.11% of the training corpus. This underrepresentation adversely impacts the comprehension and generation capabilities in Japanese. Furthermore, reliance on foreign models raises concerns about intellectual asset security and cultural overshadowing. To this end, LLM-jp seeks to develop robust Japanese LLMs, ensuring full transparency and open access for both academic and commercial use.
Organizational Structure and Model Development
LLM-jp's structure includes several working groups (WGs) focused on different aspects of LLM development. Initially, three WGs were established: the Corpus Building WG, Model Building WG, and Fine-tuning and Evaluation WG, followed by additional groups like the Computational Infrastructure WG, Academic Domain WG, and Safety WG.
Corpus Building WG
The Corpus Building WG is responsible for the creation of pre-training corpora and tokenizers. Key contributions include:
- Corpus v1: Consisting of over 260B tokens from Japanese, English, and code corpora, leveraging the Chinchilla scaling law for optimal token-to-parameter ratios.
- Corpus v2: Focused on higher-quality data, it incorporates 285.5B Japanese tokens from Common Crawl and follows stringent filtering criteria.
- Tokenizers: Developed multilingual tokenizers (tokenizers v2.1 and v2.2) optimized for effective tokenization across Japanese, English, and code texts.
The pre-training datasets and tokenizers were made publicly available, demonstrating a commitment to transparency.
Computational Infrastructure WG
This WG tackled the substantial computational challenges inherent in pre-training LLMs. Utilizing the mdx cloud computing environment comprising NVIDIA A100 GPUs, the team optimized data communication and resolved network issues to ensure efficient large-scale model training.
Model Building WG
Starting with minimal experience, the team used pre-existing frameworks like Megatron-DeepSpeed and achieved significant milestones:
- Pre-trained model v1.0: A 13B-parameter model using a GPT-2 based architecture.
- Pre-trained model v2.0: Transitioned to a Llama-based architecture, refined through exploratory experiments comparing various configurations, such as vocabulary size and corpus type.
Key achievements include optimizing model training settings for computational efficiency and handling loss divergence effectively during training.
Fine-Tuning and Evaluation
The Fine-tuning and Evaluation WG has developed and released multiple iterations of fine-tuned models:
- v1.0: Focused on Japanese instruction data and machine-translated datasets.
- v1.1: Improved instruction-following abilities using a more extensive dataset and Direct Preference Optimization (DPO).
- v2.0: Emphasized safety in responses and incorporated comprehensive instruction datasets, including AnswerCarefully for safety alignment.
Evaluation frameworks developed by the WG include LLM-jp-eval, a multifaceted benchmarking tool for diverse NLP tasks, and extended evaluations using the Japanese Vicuna QA benchmark and Japanese MT-Bench.
Safety
The Safety WG highlights the necessity of culturally specific safety measures. They developed AnswerCarefully and the LLM-jp Toxicity Dataset to align models with standards appropriate for Japanese language and culture. Collaborative efforts with external researchers are instrumental in understanding regional nuances of safety and biases.
Conclusion
LLM-jp exemplifies a collaborative, open, and transparent initiative aimed at advancing Japanese NLP. The project has established a significant operational framework and produced multiple high-quality LLMs with comprehensive public resources. Future endeavors will focus on scaling with larger models and extended datasets, while continuously refining safety and evaluation benchmarks.
The project serves as a model for cross-organizational collaboration, facilitating advancements in the field of Japanese LLMs and promoting international cooperation.