Assortment of Attention Heads: Accelerating Federated PEFT with Head Pruning and Strategic Client Selection (2506.00743v1)

Published 31 May 2025 in cs.CL, cs.AI, and cs.DC

Abstract: Parameter Efficient Fine-Tuning (PEFT) has become the de-facto approach in adapting LLMs for downstream tasks in Natural Language Processing. However, its adoption in privacy-preserving distributed learning frameworks, such as Federated Learning (FL), remains relatively limited. This is mainly due to challenges specific to FL, such as resource-constrained devices and diverse data distributions among clients. In this paper, we propose an efficient method to perform PEFT within the FL framework for Multi-Head Attention (MHA) based LLMs. We address the challenges through head pruning, a novel head-specific weighted aggregation mechanism, and a client selection strategy. Head pruning minimizes training complexity within the clients, guided by the importance score computed based on the confidence of the attention head. Weighted aggregation of heads ensures the global model captures crucial updates from diverse clients complementing our client selection strategy. We show results on the MultiNLI benchmark along with 20 Newsgroups, XL-Sum, and E2E NLG datasets. We use the MultiNLI dataset and T5-small model with LoRA as our PEFT method, attaining sparsity levels of up to 90%, resulting in a communication advantage of up to 1.8x and a reduction in training OPs of 3.9x while maintaining the accuracy drop under 2%.

PDF Abstract

Assortment of Attention Heads: Accelerating Federated PEFT with Head Pruning and Strategic Client Selection

This paper addresses the challenges associated with Parameter Efficient Fine-Tuning (PEFT) of LLMs within the context of Federated Learning (FL). It specifically focuses on the limitations posed by resource-constrained devices and diverse data distributions among clients, which impede the effective implementation of PEFT in privacy-preserving distributed frameworks. The authors propose a novel method for enhancing PEFT performance within FL by leveraging head pruning, weighted head-specific aggregation, and strategic client selection techniques.

Multi-Head Attention (MHA) mechanisms are integral to transformer-based architectures, offering a structured means of processing intricate textual details. However, redundancy in attention heads allows for potential pruning without negatively impacting model accuracy. The authors exploit this feature by implementing head pruning based on the computed importance scores derived from attention confidence levels. This approach yields a substantial reduction in training complexity for individual clients while maintaining accuracy levels under a 2% drop, as demonstrated on the MultiNLI benchmark.

Significant numerical results underline the success of this head pruning strategy, achieving sparsity levels of up to 90%. Consequently, this translates to communication improvements of up to 1.8 times and a reduction in operational training complexity by 3.9 times compared to training fully dense models using standard FedAvg.

Another essential contribution of this work is its innovative approach to model aggregation in FL. By employing a weighted aggregation mechanism, the server accentuates significant updates based on importance scores from diverse client data distributions. Additionally, the client selection strategy, driven by loss differences between the global and local models, further optimizes the training process by prioritizing clients with maximal impact updates.

The paper demonstrates the robustness of the proposed method across multiple datasets, including 20 Newsgroups, XL-Sum, and E2E NLG. The authors employ Low-Rank Adapters (LoRA) as the PEFT strategy, showing the versatility of the approach irrespective of the chosen PEFT method, assuming it aligns trainable parameters with specific MHA heads.

Implications of this research are profound, offering insights into efficiently tuning LLMs amidst federated settings and providing theoretical underpinnings for further investigations in resource-efficient model training. Speculation on future developments includes extending the pruning methodology to other transformer components, enhancing efficiency in FL paradigms beyond LLMs, such as vision transformers.

In summary, this paper offers a compelling strategy to scale PEFT operations within FL frameworks, considering inherent complexity challenges, and contributes significantly to the discourse on optimizing distributed learning systems.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Yeshwanth Venkatesha (15 papers)
Souvik Kundu (76 papers)
Priyadarshini Panda (104 papers)

Related Papers

Find Related Papers

YouTube

Show All Videos