A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine

Published 29 Jan 2026 in cs.CL and cs.DC | (2601.22124v1)

Abstract: LLMs have demonstrated strong performance on medical benchmarks, including question answering and diagnosis. To enable their use in clinical settings, LLMs are typically further adapted through continued pretraining or post-training using clinical data. However, most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems. Federated learning (FL) is a promising solution for enabling collaborative model development across healthcare institutions. Yet applying FL to LLMs in medicine remains fundamentally limited. First, conventional FL requires transmitting the full model during each communication round, which becomes impractical for multi-billion-parameter LLMs given the limited computational resources. Second, many FL algorithms implicitly assume data homogeneity, whereas real-world clinical data are highly heterogeneous across patients, diseases, and institutional practices. We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications. Fed-MedLoRA transmits only low-rank adapter parameters, reducing communication and computation overhead, while Fed-MedLoRA+ further incorporates adaptive, data-aware aggregation to improve convergence under cross-site heterogeneity. We apply the framework to clinical information extraction (IE), which transforms patient narratives into structured medical entities and relations. Accuracy was assessed across five patient cohorts through comparisons with BERT models, and LLaMA-3 and DeepSeek-R1, GPT-4o models. Evaluation settings included (1) in-domain training and testing, (2) external validation on independent cohorts, and (3) a low-resource new-site adaptation scenario using real-world clinical notes from the Yale New Haven Health System.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Fed-MedLoRA and Fed-MedLoRA+ frameworks that use LoRA modules for efficient federated LLM training, achieving up to 65% F1 improvement.
It demonstrates that adaptive client reweighting effectively enhances cross-site generalization and robustness in heterogeneous, non-IID clinical data.
The approach reduces communication overhead by 98.5% and supports scalable deployment even on resource-constrained devices.

Federated and Parameter-Efficient LLM Training for Clinical NLP

Introduction and Motivation

Clinical NLP faces unique challenges due to the sensitive and heterogeneous nature of medical data. LLMs show strong performance on standard medical NLP tasks, but real-world deployment is constrained by institutional privacy, regulatory barriers, and site-specific data heterogeneity. Most medical LLMs are fine-tuned on single-institution data, resulting in critical deficits in cross-site generalization and safety. While federated learning (FL) is a paradigm that enables collaborative model development without raw data exchange, its application to multi-billion parameter LLMs in medicine has been limited by prohibitive communication overhead and instability under data heterogeneity. Existing FL methods often require full-model exchange, which is infeasible for clinical institutions with limited compute and bandwidth.

Methodological Advances: Fed-MedLoRA and Fed-MedLoRA+

The authors introduce Fed-MedLoRA and Fed-MedLoRA+, two model-agnostic, parameter-efficient FL frameworks for LLM adaptation to medical NLP. Both frameworks utilize low-rank adaptation (LoRA) modules—a parameter-efficient fine-tuning technique where only trainable low-dimensional adapters are updated—substantially decreasing communication cost relative to classical full-model updates.

Fed-MedLoRA employs a direct extension of federated averaging (FedAvg), aggregating LoRA modules across clients (institutions). Fed-MedLoRA+, however, incorporates adaptive, data-aware aggregation: client contributions are reweighted based on local validation performance to mitigate instability due to divergent local data distributions. This adaptive weighting is critical for robust convergence in the presence of non-IID, heterogeneous clinical corpora.

These methods are backbone-agnostic and were evaluated with recent open-weight LLMs (LLaMA-3 and DeepSeek-R1) across multiple clinical NLP datasets, providing rigorous head-to-head comparisons with zero-shot LLMs, fine-tuned models, domain-specific BERTs, GPT-4o, and state-of-the-art general-domain FL algorithms.

Evaluation Protocols and Experimental Setup

The evaluation pipeline targets two canonical clinical IE tasks: named entity recognition (NER) and relation extraction (RE). The authors draw from five independent cohorts (MIMIC-III, MTSamples, UTP, I2B2, Yale New Haven Health), encompassing substantial entity and relation diversity, and simulating a range of real-world cross-institutional validation scenarios:

In-domain: Models trained and tested within the same institution—reflects standard practice but ignores realistic data shifts.
External validation: Training on multiple sites and testing on unseen external cohorts—directly quantifies cross-site generalization.
Low-resource adaptation: Minimal labeled data at a new site (YNHH), simulating a cold-start deployment scenario—tests the ability to bootstrap models in label-sparse environments.

Performance is quantified by strict/lenient micro F1 for NER/RE. The framework's robustness to incomplete label coverage, communication/memory requirements, and horizontal scalability (up to 10 clients) are also reported, reflecting operational clinical constraints.

Strong Empirical Results and Contradictory Findings

Accuracy and Generalization

Fed-MedLoRA+ improves zero-shot LLM performance by up to 65% absolute F1 and outperforms single-site fine-tuning by ∼25%.
For relation extraction—a challenging cross-entity task—Fed-MedLoRA+ exceeds domain-specific BERT by >40% absolute F1.
Both federated methods deliver 10–70% F1 gains over baselines on external cohorts and achieve strict/lenient F1 of 73%/85% for new-site adaptation with limited labeled data.
Accuracy remains competitive with hypothetical centralized training (upper bound)—Fed-MedLoRA+ typically incurs <2% F1 loss.

Efficiency and Feasibility

Communication overhead is reduced by 98.5% versus full-model fine-tuning. For an 8B backbone, LoRA-only updates require ~1.25GB per round (vs. ~30GB for full model exchange).
Training is feasible on a single RTX 4090 (16GB) for 8B models and on RTX 3060 Ti for 1B models, with inference supported on commodity laptops.
Scaling to 10 federated sites produces only moderate degradation (≤4% NER, ≤8% RE), unlike single-site models which degrade sharply as sites increase.

Robustness to Realistic Clinical Scenarios

Fed-MedLoRA+ maintains strong performance even when sites supply uneven or incomplete task annotations. NER accuracy drops only 1–2%, and RE 3–6% compared to full annotation settings.
The parameter-efficient framework allows for effective deployment even in resource-constrained environments.

Error Analysis

Manual review identifies boundary errors and false negatives as the dominant error modes, consistent with task complexity and strict evaluation protocols. Type confusions and merged/split entities also account for a significant fraction of errors, highlighting ongoing challenges in entity boundary detection and annotation consistency.

Implications, Practical and Theoretical

Federated, parameter-efficient LLM adaptation enables development of clinically accurate, privacy-preserving, and resource-feasible models deployable at scale. This work demonstrates that meaningful generalization and state-of-the-art performance are achievable without raw data sharing or massive compute infrastructure, if proper aggregation and communication-reduction techniques are adopted.

Contradictory to clinical NLP orthodoxy, federated LLMs with LoRA fine-tuning consistently surpass both BERT-based and non-federated LLM pipelines in external validation and RE task settings. The results also challenge prior assumptions that BERT may be sufficient for clinical IE, especially in cross-site applications.

Fed-MedLoRA+ specifically shows that adaptive aggregation, grounded in per-client validation, is crucial for stable multi-institutional learning under realistic, non-IID medical data regimes.

When to Prefer LLMs Over BERT Models

Although LLMs require more compute and parameter storage than BERTs, the federated, LoRA-adapted approach renders them practical for distributed settings. The advantages of LLMs become pronounced for:

Cross-site generalization: BERTs exhibit poor out-of-domain performance, whereas LLMs with federated fine-tuning retain robustness.
Complex extractive tasks (RE): LLMs close the performance gap for RE, which is notably challenging for BERTs.
Multi-task configurations with incomplete label coverage: LLMs naturally support instruction-tuned multitasking, which is cumbersome for modular BERT pipelines.

Limitations and Future Directions

Limitations include absence of explicit formal privacy mechanisms (e.g., DP, secure aggregation), focus on a single downstream clinical NLP task (IE), and lack of real-world prospective clinical deployment. Systematic evaluation of federated LLMs integrated with differential privacy or secure aggregation—and across a broader range of clinical and generative tasks—remains open. Prospective trials in operational health systems are necessary to validate real-world applicability.

Conclusion

This work provides the first systematic, practically viable framework for federated, parameter-efficient LLM adaptation to clinical NLP, resolving longstanding barriers in cross-institutional medical AI. Through rigorous benchmarks and practical deployment analysis, it demonstrates that privacy-preserving, resource-efficient, and generalizable medical LLMs are achievable by combining LoRA-based adaptation and adaptive federated optimization. Future directions include integrating explicit privacy guarantees, automated client-task allocation, and expansion to diverse clinical downstream tasks.

Reference:

"A Federated and Parameter-Efficient Framework for LLM Training in Medicine" (2601.22124)

Markdown