Survey and Evaluation of Converging Architecture in LLMs based on Footsteps of Operations (2410.11381v1)

Published 15 Oct 2024 in cs.LG, cs.AI, and cs.CL

Abstract: The advent of the Attention mechanism and Transformer architecture enables contextually natural text generation and compresses the burden of processing entire source information into singular vectors. Based on these two main ideas, model sizes gradually increases to accommodate more precise and comprehensive information, leading to the current state-of-the-art LLMs being very large, with parameters around 70 billion. As the model sizes are growing, the demand for substantial storage and computational capacity increases. This leads to the development of high-bandwidth memory and accelerators, as well as a variety of model architectures designed to meet these requirements. We note that LLM architectures have increasingly converged. This paper analyzes how these converged architectures perform in terms of layer configurations, operational mechanisms, and model sizes, considering various hyperparameter settings. In this paper, we conduct a concise survey of the history of LLMs by tracing the evolution of their operational improvements. Furthermore, we summarize the performance trends of LLMs under various hyperparameter settings using the RTX 6000, which features the state-of-the-art Ada Lovelace architecture. We conclude that even the same model can exhibit different behaviors depending on the hyperparameters or whether it is deployed in server or edge environments.

Authors (5)

Seongho Kim (4 papers)
Jihyun Moon (1 paper)
Juntaek Oh (2 papers)
Insu Choi (3 papers)
Joon-Sung Yang (4 papers)

Summary

Survey and Evaluation of Converging Architecture in LLMs

The paper "Survey and Evaluation of Converging Architecture in LLMs based on Footsteps of Operations" presents a meticulous analysis of the evolving architectures of LLMs, which have become integral tools in NLP. As LLMs continue to expand in their capability and complexity, this paper comprehensively evaluates their architectural convergence, operational mechanisms, and performance dynamics across various hyperparameter settings.

The authors trace the progression from early rule-based systems and RNNs to the sophisticated transformer-based architectures. The central theme is the advent of attention mechanisms and transformative models, notably the transformer, which significantly enhanced processing efficiency and comprehension of context. As LLMs have grown to colossal sizes, often containing up to 70 billion parameters, the paper emphasizes the resulting burdens on storage and computational resources, necessitating developments in memory and processing accelerators.

The survey thoroughly compares the evolution from RNNs, highlighting the limitations addressed by LSTMs, and transitions through the encoder-decoder architectures to transformers, which eschewed sequential computation in favor of parallelizable tasks. This transition allowed modern LLMs to leverage multi-head attention and feed-forward networks for enhanced model performance.

Key architectures like Meta's Llama and Google's Gemma were extensively analyzed, illustrating their respective design choices in architecture, such as the use of grouped query attention (GQA) and multi-query attention (MQA), which aim to optimize computational efficiency and memory usage.

The paper presents detailed empirical evaluations, utilizing RTX 6000 with Ada Lovelace architecture GPUs, to assess performance variations across models. The results indicated substantial differences in behavior based on hyperparameter adjustments and deployment environments. This underscores the necessity for careful configuration to optimize model performance across diverse contexts.

The exploration included the examination of kernel operations within LLM execution, notably the contrast between GEMM- and GEMV-intensive tasks in different operational stages. The investigation found that, particularly in edge environments with smaller batch sizes, GEMV operations dominate generation stages, while GEMM takes precedence in server environments with larger batch sizes.

The implications of this research are multifaceted. Practically, it guides the selection and tuning of LLMs based on specific use cases, whether edge or server-based. Theoretically, it highlights the architectural convergences and computational strategies that are likely to influence future AI developments. As LLMs continue to scale, the paper anticipates further innovations in hardware and model design to manage the increasing demands.

Overall, this survey provides a robust foundation for understanding the current landscape of LLM architectures, shedding light on both the challenges and opportunities within the domain of advanced natural language processing.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_reachsumit/status/1846770193491702077