Comparative Study of Large Language Model Architectures on Frontier (2402.00691v1)

Published 1 Feb 2024 in cs.DC

Abstract: LLMs have garnered significant attention in both the AI community and beyond. Among these, the Generative Pre-trained Transformer (GPT) has emerged as the dominant architecture, spawning numerous variants. However, these variants have undergone pre-training under diverse conditions, including variations in input data, data preprocessing, and training methodologies, resulting in a lack of controlled comparative studies. Here we meticulously examine two prominent open-sourced GPT architectures, GPT-NeoX and LLaMA, leveraging the computational power of Frontier, the world's first Exascale supercomputer. Employing the same materials science text corpus and a comprehensive end-to-end pipeline, we conduct a comparative analysis of their training and downstream performance. Our efforts culminate in achieving state-of-the-art performance on a challenging materials science benchmark. Furthermore, we investigate the computation and energy efficiency, and propose a computationally efficient method for architecture design. To our knowledge, these pre-trained models represent the largest available for materials science. Our findings provide practical guidance for building LLMs on HPC platforms.

PDF Abstract

Comparative Study of LLM Architectures on Frontier

The paper delivers a systematic examination of the Generative Pre-trained Transformer (GPT) architecture variants — specifically GPT-NeoX and LLaMA — through the lens of a computationally intensive pre-training regime leveraging the Frontier supercomputer, the world’s first Exascale system. It highlights a methodological approach for training LLMs tailored for the domain of materials science, implementing a comprehensive end-to-end training pipeline focused on achieving state-of-the-art outcomes.

Key Contributions

Controlled Training Setup: The paper identifies significant variability in GPT architecture performance, attributable to different experimental conditions such as data preprocessing and training parameters. By standardizing these factors, the authors conduct a controlled paper facilitating direct comparisons between the training and downstream performance of GPT-NeoX and LLaMA architectures.
Harnessing Frontier’s Computational Power: The research benefits from Frontier’s robust computational capabilities, allowing the authors to pre-train models using extensive parameters and input data, which is vital for advancing LLM efficacy in domain-specific applications.
Evaluation of Model Architectures: The authors meticulously evaluate two prominent open-sourced GPT architectures using the materials science text corpus, examining performance aspects such as zero-shot and few-shot capabilities on established language benchmarks. Additionally, they provide performance analysis concerning computation and energy efficiency, offering insights into optimal architecture designs for high-performance computing platforms.
Introduction of MatGPT: The paper introduces MatGPT, a newly pre-trained set of foundation models for materials science. It is posited as the largest publicly available model for this domain, offering practical utility in scientific knowledge extraction tasks, evidenced by notable performance on a challenging materials science benchmark.
Enabling Efficient Scientific Applications: The authors propose a novel downstream regression task, using scientific publications' knowledge to improve material property prediction. By embedding insights from pre-trained models into graph neural networks, they achieve superior performance in band gap predictions, underscoring the scientific approximations these models can facilitate.

Implications and Future Directions

Realizing Optimal Data-to-Parameter Ratio: Aligning with findings from other studies, the authors explore the data-to-parameter ratio to optimize model performance. They establish that while increasing model size and pre-training data predictably enhances model accuracy, the specifics of loss reduction suggest that careful tuning of the TPU allocation and architectural components (like tokenizer type and vocabulary size) significantly influence outcomes.
Practical Guidelines for HPC-Based LLM Training: The pragmatic guidance extracted from the paper emphasizes a nuanced understanding of communication overheads in distributed training. Notably, the efficiencies gained in reducing time-to-solution through innovative architectures can be applied to future efforts in minimizing energy consumption and maximizing computational throughput in LLM training on Exascale systems.
Advancements in Domain-Specific LLM Applications: The effectiveness of MatGPT in materials science illustrates the potential for larger LLMs in other scientific domains. The methodology presents a blueprint for researchers aiming to deploy LLMs tailored to niche scientific fields, which can lead to improvements in efficiency and innovation in data-driven scientific research.
Energy Efficiency in Large-Scale Model Training: Another significant aspect of the research is the attention to energy consumption during model pre-training. Establishing methodologies that concurrently optimize both computational resource use and power consumption presents a future research trajectory, encouraging sustainable AI practices.

The research sheds light on best practices for deploying LLMs on shared HPC platforms to optimize constraints surrounding computational and energy resources. Furthermore, it underscores the transformative potential of such models in driving advancements within specialized scientific domains.