Papers
Topics
Authors
Recent
2000 character limit reached

FLAME: Federated Benchmark for Robotic Manipulation

Updated 18 December 2025
  • The paper introduces FLAME, a federated learning benchmark that decentralizes robotic manipulation, enhancing privacy and scalability through distributed policy updates.
  • FLAME evaluates common algorithms like FedAvg, FedAvgM, FedOpt, and Krum across 420 unique environments per task, leveraging 42,000 expert demonstrations per task.
  • Empirical results highlight significant challenges in precision tasks and reveal that offline metrics like RMSE do not always predict online success, prompting advances in personalized FL.

Federated learning (FL) for robotic manipulation addresses the limitations of centralized policy learning in robotics, including data privacy concerns, scaling bottlenecks, and lack of adaptability. The FLAME benchmark—Federated Learning Across Manipulation Environments—provides a standardized, large-scale platform to evaluate decentralized, privacy-preserving policy learning for robotic manipulation (Betran et al., 3 Mar 2025). FLAME offers a comprehensive suite of datasets, a modular evaluation protocol, and critical empirical insights into the performance and limitations of current FL algorithms in manipulation domains.

1. Motivation and Central Tenets

Centralized training in robotic manipulation involves transferring all collected demonstration data to a single server, facilitating large-scale policy learning but introducing significant challenges: privacy concerns when data originates from heterogeneous users (e.g., in-home robots), scalability limits due to growing bandwidth and storage requirements, and poor adaptability when new robots/environments are introduced. Federated Learning mitigates these issues by enabling each robot (client) to retain private demonstration data locally—sharing only model updates with a central server, which aggregates the updates and broadcasts global model parameters.

FL confers distinct advantages:

  • Privacy preservation: Raw trajectory data never leaves the client.
  • Scalability: Computation is distributed; only model parameters are aggregated.
  • Continual/adaptive learning: Clients can join or leave dynamically, obviating the need for data re-centralization and re-training.

2. Composition of the FLAME Benchmark

Tasks, Environments, and Data

FLAME's construction rests on four representative tasks from the RLBench suite:

  • Slide Block to Target
  • Close Box
  • Insert Onto Square Peg
  • Scoop With Spatula

These tasks are systematically perturbed along five factor categories: color, texture, object variations, physical properties, and camera view. Each task comprises 420 unique environments (400 for training, 10 for validation, 10 for testing). For every environment, K=100K = 100 expert demonstrations are collected, resulting in $42,000$ demonstrations per task ($168,000$ in total).

Each FLAME client corresponds to one unique environment, characterized by fixed perturbation parameters across all local demonstrations. Initial object poses for each demonstration are randomized to mitigate overfitting. Notably, each client's data is non-I.I.D.: visual and physical attributes, as well as object configurations, differ across environments. Task difficulties vary, with precision-demanding tasks (Peg-in-Square, Scoop) being more challenging than others (Close Box, Slide Block).

Training and Client Distribution

Robotic manipulation data in FLAME is partitioned such that each client is responsible for policy learning within its unique environment. This explicit heterogeneity is foundational to FLAME’s benchmark structure.

3. Federated Learning Formulation in FLAME

FLAME formalizes policy learning under FL as a minimization of a global behavior cloning loss:

minθF(θ)=k=1KnknFk(θ),\min_{\theta} F(\theta) = \sum_{k=1}^{K} \frac{n_k}{n} F_k(\theta),

where

Fk(θ)=E(s,a)Dk[(πθ(s),a)]=1nk(s,a)Dkπθ(s)a2,F_k(\theta) = \mathbb{E}_{(s, a) \sim D_k} [\ell(\pi_\theta(s), a)] = \frac{1}{n_k} \sum_{(s, a) \in D_k} \| \pi_\theta(s)-a \|^2,

with πθ\pi_\theta representing the policy parameterized by θ\theta.

Algorithms Evaluated

FLAME supports and evaluates several canonical FL algorithms:

  • FedAvg: At each round tt, the server disseminates global parameters θt\theta_t to sampled clients. Each client performs multiple local epochs with local data DkD_k via SGD/Adam, and returns updated parameters θt+1k\theta_{t+1}^{k}. The server aggregates these via weighted averaging.
  • FedAvgM: An extension of FedAvg with server-side momentum to stabilize oscillations arising from non-I.I.D. client distributions.
  • FedOpt: Employs adaptive optimization (e.g., server-side Adam) for aggregating heterogeneous client updates.
  • Krum: Implements a Byzantine-robust aggregation by selecting the update closest (in parameter space) to the majority, filtering outliers.

A concise summary table is provided below:

Algorithm Aggregation Strategy Specialization
FedAvg Weighted mean Standard FL baseline
FedAvgM FedAvg + server momentum Non-I.I.D. stabilization
FedOpt Adaptive server optimization Heterogeneous updates
Krum Outlier-robust selection Byzantine robustness

Federated training proceeds for R=30R=30 rounds, with C=20C=20 clients randomly selected per round from a pool of 400 training clients. Each client executes 50 local epochs per round, utilizing the Adam optimizer.

4. Evaluation Protocol and Metrics

Protocol Architecture

The FLAME evaluation stack is built atop the Flower federated learning framework, interfaced with RLBench via a custom wrapper—facilitating environment instantiation, dataset loading, and model exchanges.

Metrics

Two principal performance metrics are employed:

  • Offline Metric: Root Mean Squared Error (RMSE) between predicted and expert actions on held-out test sets for each environment.
  • Online Metric: Normalized Success Rate, computed as the proportion of successful task completions over $50$ episodes in each of $10$ test environments.

Final results are reported as averaged values (mean ± standard deviation) across all sampled clients/environments.

5. Empirical Findings

Task Difficulty and FL Algorithm Comparison

Distinct variations in task difficulty are observed:

  • Close Box: All FL methods achieve low RMSE and high online success (FedAvg SR ≈ 0.84).
  • Slide Block: Moderate offline and online performance (RMSE \sim 0.026, SR \sim 0.24 for FedAvg).
  • Peg-in-Square and Scoop: All FL methods yield SR = 0, signifying current FL baselines fail on precision tasks.

FL method ablation reveals:

  • FedAvg and FedOpt achieve similar RMSE; FedOpt slightly underperforms on Close Box.
  • FedAvgM and Krum often exhibit higher RMSE; Krum, however, attains best online SR in Close Box despite larger RMSE.
  • More clients per aggregation round and more demonstrations per client both enhance final performance.
  • Increasing local epochs beyond an optimal threshold (≈25) induces local overfitting, which degrades aggregation.
  • Offline RMSE does not always predict online SR, particularly for Krum on Close Box.

Representative Results Overview

  • Table II(a): Presents task-by-task offline RMSE ×102\times 10^{-2}.
  • Table II(b): Summarizes online normalized success rates.
  • Figure 1: Depicts ablation studies on the number of clients, demonstrations, local epochs, and rounds for the Slide Block task using FedAvg.

6. Outstanding Challenges and Directions

Several key challenges are evident:

  • Non-I.I.D. Demonstrations: Distributional variation across clients leads to client drift and unstable aggregation.
  • Task Heterogeneity: Existing FL baselines are inadequate for precision tasks.
  • Scalability: Large client pools and extensive communication rounds necessitate efficient sampling and gradient compression.
  • Evaluation Gap: Offline losses (RMSE) and online task success can diverge, complicating model selection.
  • Privacy–Utility Trade-Off: Stricter privacy measures (e.g., differential privacy) may degrade policy accuracy.

Prospective Advances

Opportunities for advancement include:

  • Personalized FL schemes (per-client fine-tuning) to address non-I.I.D. skills.
  • Multi-task aggregation mechanisms that incorporate task similarity (clustered aggregation paradigms).
  • Algorithmic approaches for gradient compression or quantization to mitigate communication burden.
  • Co-design of offline and online metrics to minimize the sim-to-policy gap.
  • Incorporation of privacy mechanisms (e.g., DP-FedAvg) and systematic study of privacy–utility trade-offs.
  • FLAME extension to real-robot datasets and continuous learning protocols (clients with dynamic participation).
  • Investigation of federated reinforcement learning approaches beyond behavior cloning.

FLAME provides a foundational benchmark for research into scalable, adaptive, and privacy-aware robotic policy learning under federated paradigms, enabling systematic exploration of algorithmic, architectural, and practical dimensions of distributed robotic manipulation (Betran et al., 3 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Federated Learning Benchmark for Robotic Manipulation.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube