Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models (2405.18377v1)

Published 28 May 2024 in cs.AI

Abstract: The abilities of modern LLMs in solving natural language processing, complex reasoning, sentiment analysis and other tasks have been extraordinary which has prompted their extensive adoption. Unfortunately, these abilities come with very high memory and computational costs which precludes the use of LLMs on most hardware platforms. To mitigate this, we propose an effective method of finding Pareto-optimal network architectures based on LLaMA2-7B using one-shot NAS. In particular, we fine-tune LLaMA2-7B only once and then apply genetic algorithm-based search to find smaller, less computationally complex network architectures. We show that, for certain standard benchmark tasks, the pre-trained LLaMA2-7B network is unnecessarily large and complex. More specifically, we demonstrate a 1.5x reduction in model size and 1.3x speedup in throughput for certain tasks with negligible drop in accuracy. In addition to finding smaller, higher-performing network architectures, our method does so more effectively and efficiently than certain pruning or sparsification techniques. Finally, we demonstrate how quantization is complementary to our method and that the size and complexity of the networks we find can be further decreased using quantization. We believe that our work provides a way to automatically create LLMs which can be used on less expensive and more readily available hardware platforms.

A Method for Pareto-Optimal Network Architectures in LLMs

The paper by Anthony Sarah, Sharath Nittur Sridhar, Maciej Szankin, and Sairam Sundaresan addresses the computational and memory challenges associated with deploying LLMs, such as LLaMA2-7B, on hardware platforms other than high-end GPUs. The authors propose a novel method utilizing one-shot Neural Architecture Search (NAS) to find Pareto-optimal network architectures that maintain performance while reducing size and complexity. This method leverages evolutionary search techniques to discover efficient sub-networks without requiring extensive re-training, thus optimizing both computational efficiency and model performance.

Key Contributions

The authors list several significant contributions:

  1. Application of One-Shot NAS for LLMs: To the authors' knowledge, this is the first application of one-shot NAS to efficiently reduce the size and computational complexity of LLMs. They demonstrate that for certain standard benchmark tasks, LLaMA2-7B is larger than necessary.
  2. Outperformance of Pruning and Sparsification: The proposed method outperforms traditional pruning and sparsification techniques without needing additional recovery fine-tuning.
  3. Parameter Analysis: A thorough analysis of network parameters reveals no single set of architectural heuristics can be universally applied across multiple standard benchmark tasks.
  4. Generalizability: The framework produces compressed LLMs usable "out-of-the-box" without specialized software kernels or hardware. These networks can be further compressed via standard quantization techniques.

Methods

The proposed method adapts the InstaTune paradigm for NAS, which integrates architecture search within the fine-tuning phase to conserve computational resources. Specifically, the authors fine-tuned the LLaMA2-7B on the Alpaca dataset using techniques from InstaTune, but unlike InstaTune, they did not employ strong teacher or knowledge distillation, favoring an evolutionary search framework (LINAS) to refine the architecture in a multi-objective setting—optimizing for model size and accuracy.

Evaluation

Hyper-Parameters: The model was fine-tuned for six epochs with an initial learning rate of 10510^{-5} and a global batch size of 128. The LINAS algorithm was used with a population size of 50 and 250 iterations per task.

Tasks: The method was evaluated on several common LLM benchmarks:

  • AI2 Reasoning Challenge (ARC): Tasks with different complexity levels (Easy and Challenge).
  • Massive Multitask Language Understanding (MMLU): To measure knowledge acquisition across a variety of subjects.
  • TruthfulQA: To assess the truthfulness of model-generated responses.
  • WinoGrande: For commonsense reasoning, focusing on pronoun resolution challenges.

Results

The findings highlighted several Pareto-optimal sub-networks that exhibit significant size reductions and throughput improvements with minimal or no drop in accuracy. Notably:

  • For ARC, specific sub-networks demonstrated a 1.1x smaller size while maintaining the same accuracy as the pre-trained LLaMA2-7B model.
  • In the MMLU benchmark, certain sub-networks showed improvements with a 1.5x reduction in model size and a 1.3x speedup in inference throughput.
  • For TruthfulQA, sub-networks achieved a 3.6\% increase in accuracy while being 1.6x smaller.
  • For WinoGrande, sub-networks maintained accuracy with a 1.1x reduction in size.

Comparative Performance

The method demonstrated superior performance over contemporary pruning and sparsification techniques (LLM-Pruner and SliceGPT), both in computational efficiency and the resultant accuracy of reduced models. Importantly, the approach did not require the recovery fine-tuning generally necessary for those techniques.

Quantization

The authors further explored quantization, applying INT8 fixed-point quantization to the Pareto-optimal sub-networks. This resulted in considerable additional reductions in model size without compromising accuracy, thus enabling deployment on more modest hardware configurations.

Implications and Future Work

This research offers a compelling approach to making LLMs more accessible by facilitating the deployment on less expensive and more available hardware platforms. The method promises practical applications in deploying robust LLMs in environments with limited computational resources.

Future work could extend the exploration of automated NAS methodologies to other types of LLMs and tasks, aiming to generalize the findings further. Additionally, integrating these approaches with other model compression techniques could offer even more significant reductions in computational demands.

Conclusion

Overall, the paper presents a noteworthy advancement in the field of LLM optimization, significantly reducing the barriers to deploying powerful LLMs across a broader range of hardware platforms. This method stands out due to its efficacy, efficiency, and the comprehensive analysis of network architectures perfectly suited for specific benchmark tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. Llama 2: Open foundation and fine-tuned chat models, 2023.
  2. Llama: Open and efficient foundation language models, 2023.
  3. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  4. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  5. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2023.
  6. Awq: Activation-aware weight quantization for llm compression and acceleration, 2024.
  7. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023.
  8. Llm-pruner: On the structural pruning of large language models, 2023.
  9. Slicegpt: Compress large language models by deleting rows and columns, 2024.
  10. Instatune: Instantaneous neural architecture search during fine-tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1523–1527, 2023.
  11. A hardware-aware framework for accelerating neural architecture search across modalities, 2022.
  12. Lora: Low-rank adaptation of large language models, 2021.
  13. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
  14. Measuring massive multitask language understanding, 2021.
  15. Truthfulqa: Measuring how models mimic human falsehoods, 2022.
  16. Winogrande: An adversarial winograd schema challenge at scale, 2019.
  17. Adversarial filters of dataset biases. In International conference on machine learning, pages 1078–1088. Pmlr, 2020.
  18. Tim Dettmers. bitsandbytes [computer software]. https://github.com/TimDettmers/bitsandbytes, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Anthony Sarah (10 papers)
  2. Sharath Nittur Sridhar (16 papers)
  3. Maciej Szankin (7 papers)
  4. Sairam Sundaresan (17 papers)
Citations (1)