PLaMo-100B: A Ground-Up Language Model Designed for Japanese Proficiency (2410.07563v2)

Published 10 Oct 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce PLaMo-100B, a large-scale LLM designed for Japanese proficiency. The model was trained from scratch using 2 trillion tokens, with architecture such as QK Normalization and Z-Loss to ensure training stability during the training process. Post-training techniques, including Supervised Fine-Tuning and Direct Preference Optimization, were applied to refine the model's performance. Benchmark evaluations suggest that PLaMo-100B performs well, particularly in Japanese-specific tasks, achieving results that are competitive with frontier models like GPT-4. The base model is available at https://huggingface.co/pfnet/plamo-100b.

Summary

The paper demonstrates PLaMo-100B’s novel training from scratch on 2 trillion tokens to boost Japanese NLP performance.
The paper employs advanced techniques like QK Normalization, Z-Loss, and 3D parallelism, efficiently leveraging NVIDIA H100 GPUs for training stabilization.
The paper shows robust evaluation against benchmarks such as Jaster and Japanese MT-Bench, highlighting its superior proficiency in Japanese-specific tasks.

An Expert Overview of PLaMo-100B: A LLM for Japanese Proficiency

The paper "PLaMo-100B: A Ground-Up LLM Designed for Japanese Proficiency" presents the development and evaluation of PLaMo-100B, a 100 billion parameter LLM specifically tailored for Japanese tasks while also demonstrating considerable proficiency in English. Developed by Preferred Elements, Inc., PLaMo-100B underscores several methodological innovations and performance evaluations pertinent to natural language processing in Japanese.

Model Development and Architecture

PLaMo-100B is not a derivative of existing model weights; rather, it was trained from the ground up using a substantial corpus of 2 trillion tokens. The corpus included diverse data from sources such as RefinedWeb and CommonCrawl, emphasizing Japanese language data acquisition. This was particularly challenging due to the scarcity of high-quality Japanese datasets, a challenge addressed through an extensive web-crawling preprocess.

The model's architecture is a decoder-only transformer enhanced by techniques like QK Normalization and Z-Loss to stabilize training—crucial for handling large-scale models. The training leveraged modern methodologies, including 3D parallelism, the Zero Bubble technique, and FP8 training, using NVIDIA’s H100 GPUs to manage computational demands efficiently.

Post-Training Enhancements

Post-training of PLaMo-100B involved Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), enhancing its instruction-following capabilities. The strategic use of model merging techniques further refined performance, alongside employing synthetic data to augment the model’s capacity in areas lacking high-quality datasets.

Benchmarking and Evaluation

PLaMo-100B’s performance was rigorously evaluated against benchmarks such as Jaster, Japanese MT-Bench, and the Rakuda Benchmark. The results indicate strong performance in Japanese-specific tasks, with PLaMo-100B-Instruct surpassing GPT-4-0125-Preview in several categories, highlighting its adeptness in Japanese linguistic tasks. However, areas such as mathematical reasoning presented opportunities for further refinement.

Implications and Future Directions

The strong performance of PLaMo-100B in Japanese tasks suggests significant implications for applications in domains requiring proficiency in Japanese NLP, including translation services, information retrieval, and conversational AI systems tailored for Japanese users. The methodologies employed, particularly around dataset generation and training stabilization, contribute valuable insights to the ongoing advancement of LLMs.

Looking forward, there is potential to enhance its scientific and mathematical reasoning capabilities further. As AI continues to develop, the methodologies and insights from PLaMo-100B can inform future model architectures and training processes, especially for languages with underrepresented data.

In conclusion, PLaMo-100B represents a significant advancement in Japanese language modeling, backed by innovative methodologies and robust evaluations. Its development provides a concrete step forward in large-scale LLM deployment in non-English languages, setting a precedent for future developments in the field.