Hierarchical Federated Foundation Models
- HF-FMs are a modular and hierarchical framework that combine large-scale foundation models with decentralized federated learning for resource-aware, multi-modal intelligence.
- They employ vertical aggregation at device, edge, and cloud levels along with horizontal D2D relaying to address heterogeneity in modalities and tasks.
- Prototype evaluations show that HF-FMs reduce latency and energy consumption while maintaining competitive test accuracy in diverse, real-world networks.
Hierarchical Federated Foundation Models (HF-FMs) unify the generalization strength of large-scale foundation models (FMs) with the privacy and scalability benefits of decentralized, multi-level federated learning. Addressing the increasing heterogeneity of wireless, edge, and fog-based environments, HF-FMs strategically map the modular components of modern multi-modal, multi-task FMs to the hierarchical structure of these networks, enabling adaptive, resource-aware, and communication-efficient collaborative intelligence across wide-ranging and diverse nodes (Abdisarabshali et al., 3 Sep 2025).
1. Modular and Hierarchical Architecture
HF-FMs build upon the compositional design of M3T (multi-modal multi-task) FMs. The models comprise distinct modules, including modality-specific encoders (text, image, audio), prompts (learnable tokens injected at attuned network positions), Mixture-of-Experts (MoE) blocks (specialists activated on demand), lightweight adapters (often parameter-efficient fine-tuning layers such as LoRA), and task heads for downstream inference (Abdisarabshali et al., 3 Sep 2025).
The mapping of these modules onto a hierarchical fog/edge infrastructure is central:
- Device/Edge Level: Lightweight modules (e.g., adapters, task heads, prompts) are trained and aggregated locally at decentralized edge devices for rapid personalization and efficient local adaptation.
- Edge/Fog/Cloud Servers: Computationally expensive modules (e.g., backbone networks, MoE experts) are shared and aggregated at higher-tier servers, enabling coordinated global or group-level learning without overburdening resource-constrained devices.
- Device-to-Device (D2D) Relaying: Modules may be relayed horizontally across co-located devices for localized cooperative adaptation, mitigating global communication bottlenecks.
This aggregation process can be recursively defined. For parameters of module at tier : where denotes a local aggregation operator such as weighted averaging (e.g., FedAvg) at layer (Abdisarabshali et al., 3 Sep 2025).
2. Managing Heterogeneity and Module Selection
HF-FMs explicitly address two overlooked heterogeneity dimensions in fog/edge networks:
- Modality Heterogeneity: Nodes differ in sensing capabilities and locally available input types. A node may only possess an image encoder and corresponding downstream task, while another may process audio or multi-modal inputs.
- Task Heterogeneity: Nodes execute distinct tasks (e.g., video analytics, anomaly detection, speech recognition) with varying data distributions and task-specific heads.
Modules are selectively trained and aggregated based on locally available modalities and deployed tasks, supporting non-uniform, dynamic module activation and aggregation throughout the hierarchy. Personalized fine-tuning is thus context-aware: nodes may ignore irrelevant modules, reducing unnecessary computation and communication (Abdisarabshali et al., 3 Sep 2025).
3. Horizontal Relaying and Cooperative Local Training
In addition to the classical hierarchical (vertical) aggregation, HF-FMs enable device-to-device (D2D) communication for horizontal relaying:
- Cluster Formation: Devices in proximity form ad hoc clusters.
- Intra-Cluster Aggregation: A randomly selected cluster head receives model modules from cluster members via low-latency D2D links, aggregates them (e.g., by averaging updated parameters), and relays the aggregated module upward to the next-higher tier.
- Benefits: This reduces communication latency and overall energy consumption compared to a star topology, enhances convergence by sharing localized knowledge, and permits “localized” module evolution in place-sensitive environments (Abdisarabshali et al., 3 Sep 2025).
Horizontal aggregation thus complements vertical aggregation in the overall HF-FM system, enabling both rapid adaptation to local conditions and synchronization of specialized module knowledge across the network.
4. Prototype Evaluation and Empirical Results
A prototype three-tier fog/edge network consisting of 40 edge devices (tier 1), 10 edge servers (tier 2), and one cloud server (tier 3) was implemented. Edge nodes were grouped into clusters and trained on heterogenous visual question answering tasks involving diverse modalities and labels (Abdisarabshali et al., 3 Sep 2025).
Key findings include:
- Latency and Energy Efficiency: Compared to “star” topology federated foundation models, HF-FMs (with edge-level aggregation and D2D cluster relaying) achieve lower communication latency and improved energy efficiency.
- Modality/Task Adaptiveness: Optimal edge aggregation frequencies (e.g., every two local rounds) yield the best trade-off between rapid local adaptation and global synchronization.
- Test Accuracy: The HF-FM model delivers competitive or superior accuracy compared to conventional federated models, despite increased local and global variability in modalities, tasks, and datasets.
The prototype and open-source code released at https://github.com/payamsiabd/M3T-FFM enable further exploration and benchmarking.
5. Theoretical Modeling and Communication-Efficient Aggregation
Each module’s aggregation at tier and node set can be represented as: where is weighted by local dataset size or communication reliability. Asynchronous, selective, and module-wise aggregation is critical to the scalability of HF-FMs. The approach generalizes classical hierarchical federated averaging to arbitrary module graphs and supports both vertical (across tiers) and horizontal (within clusters) aggregation (Abdisarabshali et al., 3 Sep 2025).
Parameter-efficient tuning (LoRA, adapters, head updates) combined with blockwise module selection permits scalable personalization and reduces both bandwidth and compute costs, especially where not all devices can store full model weights for every modality and task (Abdisarabshali et al., 3 Sep 2025).
6. Open Research Directions and Future Extensions
HF-FMs expose several future challenges and research opportunities:
- Adaptive Aggregation and Scheduling: Algorithms must dynamically determine aggregation frequency, module selection, and routing—balancing adaptation to local context, energy, and network topology.
- Module Relaying for Novel Modalities/Tasks: Nodes may request or relay specialist modules to address cold start or evolving requirements.
- Node Specialization and Role Governance: Nodes may dynamically specialize in certain modules (e.g., image encoders) and protocols are needed for role negotiation and trust.
- Collaborative and Distributed Inference: Combined vertical and horizontal computation offloading may be required for resource-limited devices unable to run large FMs locally.
- Formal Convergence and Robustness Analyses: Understanding how hierarchical, non-uniform, and asynchronous module updates affect convergence and performance remains an open problem.
Applications extend across domains requiring geo-distributed, privacy-respecting, multimodal, multi-task intelligence—such as autonomous vehicles, smart cities, industrial IoT, and embodied AI.
7. Broader Context and Comparative Significance
HF-FMs generalize prior hierarchical federated learning to the domain of multi-modal and multi-task foundation models by leveraging their modularity and task-adaptive structure. The architectural innovations discussed—modular vertical/horizontal aggregation, module-level selection, D2D relaying—distinguish HF-FMs from simpler tree-structured FL, enabling efficient collaboration among resource/diversity-constrained devices in real-world wireless and distributed AI networks (Abdisarabshali et al., 3 Sep 2025).
The composition of edge-driven, cloud-assisted, and peer-to-peer learning protocols within the same hierarchy facilitates adaptive, robust, and scalable distributed intelligence, aligned with the complex heterogeneity and communication constraints of next-generation AI-driven networks.