Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models (2410.01497v1)

Published 2 Oct 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Recent advancements in LLMs have achieved robust performance across diverse tasks, but fine-tuning these models for specific domains remains resource-intensive. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) address this challenge by fine-tuning a small subset of parameters. However, existing methods for fusing multiple LoRAs lack dynamic fusion based on contextual inputs and often increase inference time due to token-level operations. We propose DLP-LoRA, a Dynamic Lightweight Plugin that employs a mini-MLP module with only 5M parameters to dynamically fuse multiple LoRAs at the sentence level using top-p sampling strategies. This approach reduces inference time to less than twice that of single LoRA inference by leveraging parallel computation. Evaluations across 26 tasks-including multiple-choice questions and question answering-demonstrate that DLP-LoRA achieves an average accuracy of 92.34% on multiple-choice datasets and significant improvements in BLEU and ROUGE scores on QA datasets, outperforming different LLMs backbones under composite task settings. DLP-LoRA effectively balances performance and efficiency, making it a practical solution for dynamic multi-task adaptation in LLMs. Our code is available at https://github.com/MeCuping/DLP-LoRA.

Efficient Task-Specific LoRA Fusion with DLP-LoRA: A Formal Overview

Introduction

Recent advancements in LLMs such as LLaMA 3.1, Qwen 2.5, and Gemma 2 have shown remarkable performance across various domains. These models excel in tasks including code generation, mathematical reasoning, and question answering. However, fine-tuning these models for specific tasks remains a challenging and resource-intensive process. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) offer a promising solution by permitting modifications to a smaller subset of parameters, thus reducing the computational burden.

The DLP-LoRA Approach

The paper introduces DLP-LoRA, a novel approach that employs a Dynamic Lightweight Plugin to enhance multi-task learning by dynamically fusing multiple LoRAs. The objective is to maintain high performance without significantly increasing inference time. The proposed DLP-LoRA framework leverages a mini-MLP module with only 5M parameters to dynamically fuse multiple LoRAs at the sentence level using top-pp sampling strategies. This design choice allows for reduced inference time due to effective parallel computation while maintaining robust task-specific adaptations.

Methodology

DLP-LoRA integrates three main components: a lightweight mini-MLP plugin, a base LLM backbone, and a set of fine-tuned LoRA modules. Initially, the mini-MLP classifier is trained on selected tasks to achieve high classification accuracy. This classifier then acts as a plugin for dynamically fusing fine-tuned LoRAs based on the contextual inputs at the sentence level.

Lightweight Multi-task Classification Plugin

The use of a 4-layer mini-MLP plugin for sentence-level task detection improves efficiency over token-level gating methods. By training this plugin on the same samples used for individual LoRAs, it provides accurate task classification with minimal computational overhead.

Dynamic LoRA Fusion

The framework employs top-pp sampling for selecting and fusing the LoRAs based on the initial token and the previous context. This method significantly accelerates the inference process compared to token-level gating networks by avoiding unnecessary per-token classifications.

Parallel Multi-LoRA Acceleration

The parallel computation capabilities of DLP-LoRA are enhanced by leveraging contiguous memory allocations and general matrix multiplication (GEMM) optimizations. This approach ensures that the added computational complexity of handling multiple LoRAs does not translate to proportional increases in inference time.

Experimental Evaluation

Detailed evaluations were conducted across 26 tasks, including 17 multiple-choice question (MCQ) datasets and 9 question-answering (QA) datasets. The results demonstrate that DLP-LoRA achieves an average accuracy of 92.34% on multiple-choice datasets and significant improvements in BLEU and ROUGE scores on QA datasets. Evaluations using Qwen-2 1.5B, Qwen-2 7B, LLaMA-2 7B, and LLaMA-3 8B backbones show that DLP-LoRA often matches or exceeds the performance of single-task LoRA models.

Results

Multi-task Composite Performance

Under composite task settings, DLP-LoRA achieves substantial performance improvements over baseline models, with an average relative accuracy improvement of 92.95% for MCQ tasks and notable enhancements in BLEU, ROUGE-1, and ROUGE-L scores for QA tasks.

Inference Time Efficiency

DLP-LoRA using the mini-MLP plugin achieves an inference time increase of just 18.19% on average over single LoRA models, validating its efficiency. Additionally, a smaller LLM backbone equipped with DLP-LoRA demonstrates the potential to outperform much larger, unadapted LLM backbones in both performance and inference speed.

Discussion

The utilization of top-pp sampling for LoRA selection addresses the limitations of manual top-kk selection methods. This adaptive approach not only enhances performance but also provides a flexible and efficient mechanism for multi-task learning.

Conclusion

DLP-LoRA represents an effective and efficient solution for dynamic multi-task adaptation in LLMs. By combining an easily trainable mini-MLP plugin with parallel multi-LoRA fusion strategies, it balances performance and efficiency, paving the way for practical applications in resource-constrained environments.

The paper presents a compelling case for the utility of DLP-LoRA in fine-tuning LLMs dynamically. Future research could extend this approach to larger models and further explore the trade-offs between model size, performance, and computational efficiency.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Yuxuan Zhang (119 papers)
  2. Ruizhe Li (40 papers)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com