FLAME: Federated Benchmark for Robotic Manipulation
- The paper introduces FLAME, a federated learning benchmark that decentralizes robotic manipulation, enhancing privacy and scalability through distributed policy updates.
- FLAME evaluates common algorithms like FedAvg, FedAvgM, FedOpt, and Krum across 420 unique environments per task, leveraging 42,000 expert demonstrations per task.
- Empirical results highlight significant challenges in precision tasks and reveal that offline metrics like RMSE do not always predict online success, prompting advances in personalized FL.
Federated learning (FL) for robotic manipulation addresses the limitations of centralized policy learning in robotics, including data privacy concerns, scaling bottlenecks, and lack of adaptability. The FLAME benchmark—Federated Learning Across Manipulation Environments—provides a standardized, large-scale platform to evaluate decentralized, privacy-preserving policy learning for robotic manipulation (Betran et al., 3 Mar 2025). FLAME offers a comprehensive suite of datasets, a modular evaluation protocol, and critical empirical insights into the performance and limitations of current FL algorithms in manipulation domains.
1. Motivation and Central Tenets
Centralized training in robotic manipulation involves transferring all collected demonstration data to a single server, facilitating large-scale policy learning but introducing significant challenges: privacy concerns when data originates from heterogeneous users (e.g., in-home robots), scalability limits due to growing bandwidth and storage requirements, and poor adaptability when new robots/environments are introduced. Federated Learning mitigates these issues by enabling each robot (client) to retain private demonstration data locally—sharing only model updates with a central server, which aggregates the updates and broadcasts global model parameters.
FL confers distinct advantages:
- Privacy preservation: Raw trajectory data never leaves the client.
- Scalability: Computation is distributed; only model parameters are aggregated.
- Continual/adaptive learning: Clients can join or leave dynamically, obviating the need for data re-centralization and re-training.
2. Composition of the FLAME Benchmark
Tasks, Environments, and Data
FLAME's construction rests on four representative tasks from the RLBench suite:
- Slide Block to Target
- Close Box
- Insert Onto Square Peg
- Scoop With Spatula
These tasks are systematically perturbed along five factor categories: color, texture, object variations, physical properties, and camera view. Each task comprises 420 unique environments (400 for training, 10 for validation, 10 for testing). For every environment, expert demonstrations are collected, resulting in $42,000$ demonstrations per task ($168,000$ in total).
Each FLAME client corresponds to one unique environment, characterized by fixed perturbation parameters across all local demonstrations. Initial object poses for each demonstration are randomized to mitigate overfitting. Notably, each client's data is non-I.I.D.: visual and physical attributes, as well as object configurations, differ across environments. Task difficulties vary, with precision-demanding tasks (Peg-in-Square, Scoop) being more challenging than others (Close Box, Slide Block).
Training and Client Distribution
Robotic manipulation data in FLAME is partitioned such that each client is responsible for policy learning within its unique environment. This explicit heterogeneity is foundational to FLAME’s benchmark structure.
3. Federated Learning Formulation in FLAME
FLAME formalizes policy learning under FL as a minimization of a global behavior cloning loss:
where
with representing the policy parameterized by .
Algorithms Evaluated
FLAME supports and evaluates several canonical FL algorithms:
- FedAvg: At each round , the server disseminates global parameters to sampled clients. Each client performs multiple local epochs with local data via SGD/Adam, and returns updated parameters . The server aggregates these via weighted averaging.
- FedAvgM: An extension of FedAvg with server-side momentum to stabilize oscillations arising from non-I.I.D. client distributions.
- FedOpt: Employs adaptive optimization (e.g., server-side Adam) for aggregating heterogeneous client updates.
- Krum: Implements a Byzantine-robust aggregation by selecting the update closest (in parameter space) to the majority, filtering outliers.
A concise summary table is provided below:
| Algorithm | Aggregation Strategy | Specialization |
|---|---|---|
| FedAvg | Weighted mean | Standard FL baseline |
| FedAvgM | FedAvg + server momentum | Non-I.I.D. stabilization |
| FedOpt | Adaptive server optimization | Heterogeneous updates |
| Krum | Outlier-robust selection | Byzantine robustness |
Federated training proceeds for rounds, with clients randomly selected per round from a pool of 400 training clients. Each client executes 50 local epochs per round, utilizing the Adam optimizer.
4. Evaluation Protocol and Metrics
Protocol Architecture
The FLAME evaluation stack is built atop the Flower federated learning framework, interfaced with RLBench via a custom wrapper—facilitating environment instantiation, dataset loading, and model exchanges.
Metrics
Two principal performance metrics are employed:
- Offline Metric: Root Mean Squared Error (RMSE) between predicted and expert actions on held-out test sets for each environment.
- Online Metric: Normalized Success Rate, computed as the proportion of successful task completions over $50$ episodes in each of $10$ test environments.
Final results are reported as averaged values (mean ± standard deviation) across all sampled clients/environments.
5. Empirical Findings
Task Difficulty and FL Algorithm Comparison
Distinct variations in task difficulty are observed:
- Close Box: All FL methods achieve low RMSE and high online success (FedAvg SR ≈ 0.84).
- Slide Block: Moderate offline and online performance (RMSE 0.026, SR 0.24 for FedAvg).
- Peg-in-Square and Scoop: All FL methods yield SR = 0, signifying current FL baselines fail on precision tasks.
FL method ablation reveals:
- FedAvg and FedOpt achieve similar RMSE; FedOpt slightly underperforms on Close Box.
- FedAvgM and Krum often exhibit higher RMSE; Krum, however, attains best online SR in Close Box despite larger RMSE.
- More clients per aggregation round and more demonstrations per client both enhance final performance.
- Increasing local epochs beyond an optimal threshold (≈25) induces local overfitting, which degrades aggregation.
- Offline RMSE does not always predict online SR, particularly for Krum on Close Box.
Representative Results Overview
- Table II(a): Presents task-by-task offline RMSE .
- Table II(b): Summarizes online normalized success rates.
- Figure 1: Depicts ablation studies on the number of clients, demonstrations, local epochs, and rounds for the Slide Block task using FedAvg.
6. Outstanding Challenges and Directions
Several key challenges are evident:
- Non-I.I.D. Demonstrations: Distributional variation across clients leads to client drift and unstable aggregation.
- Task Heterogeneity: Existing FL baselines are inadequate for precision tasks.
- Scalability: Large client pools and extensive communication rounds necessitate efficient sampling and gradient compression.
- Evaluation Gap: Offline losses (RMSE) and online task success can diverge, complicating model selection.
- Privacy–Utility Trade-Off: Stricter privacy measures (e.g., differential privacy) may degrade policy accuracy.
Prospective Advances
Opportunities for advancement include:
- Personalized FL schemes (per-client fine-tuning) to address non-I.I.D. skills.
- Multi-task aggregation mechanisms that incorporate task similarity (clustered aggregation paradigms).
- Algorithmic approaches for gradient compression or quantization to mitigate communication burden.
- Co-design of offline and online metrics to minimize the sim-to-policy gap.
- Incorporation of privacy mechanisms (e.g., DP-FedAvg) and systematic study of privacy–utility trade-offs.
- FLAME extension to real-robot datasets and continuous learning protocols (clients with dynamic participation).
- Investigation of federated reinforcement learning approaches beyond behavior cloning.
FLAME provides a foundational benchmark for research into scalable, adaptive, and privacy-aware robotic policy learning under federated paradigms, enabling systematic exploration of algorithmic, architectural, and practical dimensions of distributed robotic manipulation (Betran et al., 3 Mar 2025).