LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models (2405.18377v1)

Published 28 May 2024 in cs.AI

Abstract: The abilities of modern LLMs in solving natural language processing, complex reasoning, sentiment analysis and other tasks have been extraordinary which has prompted their extensive adoption. Unfortunately, these abilities come with very high memory and computational costs which precludes the use of LLMs on most hardware platforms. To mitigate this, we propose an effective method of finding Pareto-optimal network architectures based on LLaMA2-7B using one-shot NAS. In particular, we fine-tune LLaMA2-7B only once and then apply genetic algorithm-based search to find smaller, less computationally complex network architectures. We show that, for certain standard benchmark tasks, the pre-trained LLaMA2-7B network is unnecessarily large and complex. More specifically, we demonstrate a 1.5x reduction in model size and 1.3x speedup in throughput for certain tasks with negligible drop in accuracy. In addition to finding smaller, higher-performing network architectures, our method does so more effectively and efficiently than certain pruning or sparsification techniques. Finally, we demonstrate how quantization is complementary to our method and that the size and complexity of the networks we find can be further decreased using quantization. We believe that our work provides a way to automatically create LLMs which can be used on less expensive and more readily available hardware platforms.

References (18)

Authors (4)

Anthony Sarah (10 papers)
Sharath Nittur Sridhar (16 papers)
Maciej Szankin (7 papers)
Sairam Sundaresan (17 papers)

Citations (1)

View on Semantic Scholar

Summary

A Method for Pareto-Optimal Network Architectures in LLMs

The paper by Anthony Sarah, Sharath Nittur Sridhar, Maciej Szankin, and Sairam Sundaresan addresses the computational and memory challenges associated with deploying LLMs, such as LLaMA2-7B, on hardware platforms other than high-end GPUs. The authors propose a novel method utilizing one-shot Neural Architecture Search (NAS) to find Pareto-optimal network architectures that maintain performance while reducing size and complexity. This method leverages evolutionary search techniques to discover efficient sub-networks without requiring extensive re-training, thus optimizing both computational efficiency and model performance.

Key Contributions

The authors list several significant contributions:

Application of One-Shot NAS for LLMs: To the authors' knowledge, this is the first application of one-shot NAS to efficiently reduce the size and computational complexity of LLMs. They demonstrate that for certain standard benchmark tasks, LLaMA2-7B is larger than necessary.
Outperformance of Pruning and Sparsification: The proposed method outperforms traditional pruning and sparsification techniques without needing additional recovery fine-tuning.
Parameter Analysis: A thorough analysis of network parameters reveals no single set of architectural heuristics can be universally applied across multiple standard benchmark tasks.
Generalizability: The framework produces compressed LLMs usable "out-of-the-box" without specialized software kernels or hardware. These networks can be further compressed via standard quantization techniques.

Methods

The proposed method adapts the InstaTune paradigm for NAS, which integrates architecture search within the fine-tuning phase to conserve computational resources. Specifically, the authors fine-tuned the LLaMA2-7B on the Alpaca dataset using techniques from InstaTune, but unlike InstaTune, they did not employ strong teacher or knowledge distillation, favoring an evolutionary search framework (LINAS) to refine the architecture in a multi-objective setting—optimizing for model size and accuracy.

Evaluation

Hyper-Parameters: The model was fine-tuned for six epochs with an initial learning rate of $10^{-5}$ and a global batch size of 128. The LINAS algorithm was used with a population size of 50 and 250 iterations per task.

Tasks: The method was evaluated on several common LLM benchmarks:

AI2 Reasoning Challenge (ARC): Tasks with different complexity levels (Easy and Challenge).
Massive Multitask Language Understanding (MMLU): To measure knowledge acquisition across a variety of subjects.
TruthfulQA: To assess the truthfulness of model-generated responses.
WinoGrande: For commonsense reasoning, focusing on pronoun resolution challenges.

Results

The findings highlighted several Pareto-optimal sub-networks that exhibit significant size reductions and throughput improvements with minimal or no drop in accuracy. Notably:

For ARC, specific sub-networks demonstrated a 1.1x smaller size while maintaining the same accuracy as the pre-trained LLaMA2-7B model.
In the MMLU benchmark, certain sub-networks showed improvements with a 1.5x reduction in model size and a 1.3x speedup in inference throughput.
For TruthfulQA, sub-networks achieved a 3.6\% increase in accuracy while being 1.6x smaller.
For WinoGrande, sub-networks maintained accuracy with a 1.1x reduction in size.

Comparative Performance

The method demonstrated superior performance over contemporary pruning and sparsification techniques (LLM-Pruner and SliceGPT), both in computational efficiency and the resultant accuracy of reduced models. Importantly, the approach did not require the recovery fine-tuning generally necessary for those techniques.

Quantization

The authors further explored quantization, applying INT8 fixed-point quantization to the Pareto-optimal sub-networks. This resulted in considerable additional reductions in model size without compromising accuracy, thus enabling deployment on more modest hardware configurations.

Implications and Future Work

This research offers a compelling approach to making LLMs more accessible by facilitating the deployment on less expensive and more available hardware platforms. The method promises practical applications in deploying robust LLMs in environments with limited computational resources.

Future work could extend the exploration of automated NAS methodologies to other types of LLMs and tasks, aiming to generalize the findings further. Additionally, integrating these approaches with other model compression techniques could offer even more significant reductions in computational demands.

Conclusion

Overall, the paper presents a noteworthy advancement in the field of LLM optimization, significantly reducing the barriers to deploying powerful LLMs across a broader range of hardware platforms. This method stands out due to its efficacy, efficiency, and the comprehensive analysis of network architectures perfectly suited for specific benchmark tasks.

Related Papers

Find Related Papers

Tweets

https://twitter.com/rasbt/status/1795811383512088818

https://twitter.com/_akhaliq/status/1795654894021468312

https://twitter.com/TheTuringPost/status/1798507169924702675

https://twitter.com/fly51fly/status/1795932573987991895

https://twitter.com/IAmACatAI/status/1795726779757560225

https://twitter.com/arxivsanitybot/status/1796001882399510661