A Method for Pareto-Optimal Network Architectures in LLMs
The paper by Anthony Sarah, Sharath Nittur Sridhar, Maciej Szankin, and Sairam Sundaresan addresses the computational and memory challenges associated with deploying LLMs, such as LLaMA2-7B, on hardware platforms other than high-end GPUs. The authors propose a novel method utilizing one-shot Neural Architecture Search (NAS) to find Pareto-optimal network architectures that maintain performance while reducing size and complexity. This method leverages evolutionary search techniques to discover efficient sub-networks without requiring extensive re-training, thus optimizing both computational efficiency and model performance.
Key Contributions
The authors list several significant contributions:
- Application of One-Shot NAS for LLMs: To the authors' knowledge, this is the first application of one-shot NAS to efficiently reduce the size and computational complexity of LLMs. They demonstrate that for certain standard benchmark tasks, LLaMA2-7B is larger than necessary.
- Outperformance of Pruning and Sparsification: The proposed method outperforms traditional pruning and sparsification techniques without needing additional recovery fine-tuning.
- Parameter Analysis: A thorough analysis of network parameters reveals no single set of architectural heuristics can be universally applied across multiple standard benchmark tasks.
- Generalizability: The framework produces compressed LLMs usable "out-of-the-box" without specialized software kernels or hardware. These networks can be further compressed via standard quantization techniques.
Methods
The proposed method adapts the InstaTune paradigm for NAS, which integrates architecture search within the fine-tuning phase to conserve computational resources. Specifically, the authors fine-tuned the LLaMA2-7B on the Alpaca dataset using techniques from InstaTune, but unlike InstaTune, they did not employ strong teacher or knowledge distillation, favoring an evolutionary search framework (LINAS) to refine the architecture in a multi-objective setting—optimizing for model size and accuracy.
Evaluation
Hyper-Parameters: The model was fine-tuned for six epochs with an initial learning rate of and a global batch size of 128. The LINAS algorithm was used with a population size of 50 and 250 iterations per task.
Tasks: The method was evaluated on several common LLM benchmarks:
- AI2 Reasoning Challenge (ARC): Tasks with different complexity levels (Easy and Challenge).
- Massive Multitask Language Understanding (MMLU): To measure knowledge acquisition across a variety of subjects.
- TruthfulQA: To assess the truthfulness of model-generated responses.
- WinoGrande: For commonsense reasoning, focusing on pronoun resolution challenges.
Results
The findings highlighted several Pareto-optimal sub-networks that exhibit significant size reductions and throughput improvements with minimal or no drop in accuracy. Notably:
- For ARC, specific sub-networks demonstrated a 1.1x smaller size while maintaining the same accuracy as the pre-trained LLaMA2-7B model.
- In the MMLU benchmark, certain sub-networks showed improvements with a 1.5x reduction in model size and a 1.3x speedup in inference throughput.
- For TruthfulQA, sub-networks achieved a 3.6\% increase in accuracy while being 1.6x smaller.
- For WinoGrande, sub-networks maintained accuracy with a 1.1x reduction in size.
Comparative Performance
The method demonstrated superior performance over contemporary pruning and sparsification techniques (LLM-Pruner and SliceGPT), both in computational efficiency and the resultant accuracy of reduced models. Importantly, the approach did not require the recovery fine-tuning generally necessary for those techniques.
Quantization
The authors further explored quantization, applying INT8 fixed-point quantization to the Pareto-optimal sub-networks. This resulted in considerable additional reductions in model size without compromising accuracy, thus enabling deployment on more modest hardware configurations.
Implications and Future Work
This research offers a compelling approach to making LLMs more accessible by facilitating the deployment on less expensive and more available hardware platforms. The method promises practical applications in deploying robust LLMs in environments with limited computational resources.
Future work could extend the exploration of automated NAS methodologies to other types of LLMs and tasks, aiming to generalize the findings further. Additionally, integrating these approaches with other model compression techniques could offer even more significant reductions in computational demands.
Conclusion
Overall, the paper presents a noteworthy advancement in the field of LLM optimization, significantly reducing the barriers to deploying powerful LLMs across a broader range of hardware platforms. This method stands out due to its efficacy, efficiency, and the comprehensive analysis of network architectures perfectly suited for specific benchmark tasks.