Low-Rank Adapters Meet Neural Architecture Search for LLM Compression (2501.16372v1)

Published 23 Jan 2025 in cs.LG, cs.AI, and cs.CL

Abstract: The rapid expansion of LLMs has posed significant challenges regarding the computational resources required for fine-tuning and deployment. Recent advancements in low-rank adapters have demonstrated their efficacy in parameter-efficient fine-tuning (PEFT) of these models. This retrospective paper comprehensively discusses innovative approaches that synergize low-rank representations with Neural Architecture Search (NAS) techniques, particularly weight-sharing super-networks. Robust solutions for compressing and fine-tuning large pre-trained models are developed by integrating these methodologies. Our analysis highlights the potential of these combined strategies to democratize the use of LLMs, making them more accessible for deployment in resource-constrained environments. The resulting models exhibit reduced memory footprints and faster inference times, paving the way for more practical and scalable applications of LLMs. Models and code are available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.

Summary

The paper introduces Elastic LoRA adapters, which dynamically adjust low-rank configurations via NAS to optimize model compression.
It demonstrates empirical improvements with up to 80% parameter reduction and a 1.4x speed increase in inference.
The study offers scalable, resource-efficient solutions for fine-tuning large language models with minimal accuracy tradeoffs.

Low-Rank Adapters Meet Neural Architecture Search for LLM Compression

The paper provides a comprehensive exploration of synergies between Low-Rank Adaptations (LoRA) and Neural Architecture Search (NAS) to address the challenges of compressing LLMs efficiently. The motivation for this approach stems from the growing computational demands posed by LLMs, which necessitate innovative strategies for fine-tuning and deployment in resource-constrained environments.

Background

Structured low-rank representations have gained prominence in AI for their role in parameter-efficient fine-tuning of large pre-trained models. Among these techniques, the Low-Rank Adaptation (LoRA) method stands out by extending linear layers with low-rank adapters. In contrast, NAS explores vast architectural possibilities to identify high-performing configurations, though this approach becomes computationally prohibitive as model sizes increase.

This paper posits a bi-directional enhancement through cross-pollination between low-rank methods and NAS techniques. By combining these strategies, the research highlights the potential for developing robust, efficient solutions for fine-tuning large models.

Elastic LoRA Adapters

Central to this work is the introduction of Elastic LoRA Adapters, which adjust adapter configurations dynamically. This adaptability broadens the search space for sub-adapter configurations, thereby improving model compression without significantly sacrificing performance. The paper presents two modes of elastic adapters:

Mode A focuses on elasticity by varying the rank within the low-rank matrices themselves.
Mode B achieves elasticity through adjustable input or output channels.

Synergistic Solutions

LoNAS

The LoNAS (Low-rank Adapter Search via NAS) algorithm aligns elastic adapters with the underlying model's structure, reducing the total number of parameters while maintaining accuracy. It presents empirical evidence for inference speed increases by up to 1.4 times and parameter reductions near 80%. This tuning efficiency is achievable due to heuristic sub-network evaluation, facilitating quick quality assessment.

Shears and SQFT

Shears refines LoNAS by constraining elasticity only to the low-rank layers, which is beneficial for sparse models. It uses sparsity-aware tactics like Wanda for evaluating pruning patterns, resulting in improved accuracy recovery for sparse models. SQFT extends this work by addressing the merging limitations inherent when layers are of disparate precision or density. It employs SparsePEFT and QA-SparsePEFT methodologies to maintain sparsity or precision congruency between model weights and adapters during fine-tuning.

Experimental Validation

The paper elucidates the strategies' efficacy using extensive performance assessments. Notably, LoNAS and its enhancement methods demonstrate substantive parameter reduction and speed-ups over traditional LoRA with minimal accuracy tradeoffs. Tables and comparative metrics substantiate these claims, highlighting the practical implications of integrating NAS with low-rank adapters.

Implications and Future Work

The research offers substantial contributions to the field of model compression and fine-tuning for LLMs, making these models more accessible outside high-computation environments. By strategically uniting low-rank methods and NAS, the paper emphasizes a path forward for designing scalable AI systems capable of operating efficiently on limited hardware.

Looking ahead, the exploration of more sophisticated evolutionary search algorithms and their integration into NAS workflows presents a promising avenue for future innovations. Researchers might also seek to further refine adapter merger processes within diverse model architectures and sparsity formats, potentially expanding the adaptability of these methodologies across various AI applications.