CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy

Published 10 Jun 2026 in cs.RO and cs.AI | (2606.12352v1)

Abstract: Multi-robot collaboration allows robots to efficiently take on a wide range of tasks, from moving a couch through a doorway to assembling structures on a construction site. However, achieving such coordination in mobile multi-robot settings remains challenging: centralized methods conditioned on the combined observations of a team scale poorly with team size, and decentralized methods that train one policy per robot often require explicit alignment procedures or information sharing at inference time to overcome partial observability. Our key insight is that the visuomotor priors of pretrained vision-language-action (VLA) models should enable reactive, decentralized collaboration from each robot's local observations alone, without these inference-time assumptions. We propose CHORUS, a framework that adapts a single VLA backbone to control diverse, multi-robot teams. At inference time, each robot runs an independent copy of CHORUS, conditioned only on its own observations and a robot-identifying prompt. In real-world experiments including mobile tape measurement, library book handovers, and laundry basket lifting, CHORUS achieves a 64% point improvement over decentralized, from-scratch models, improves reactivity to teammate behavior by 40% points, and outperforms centralized baselines. Together, these results show that a shared VLA backbone is capable of achieving decentralized multi-robot collaboration, without per-robot policies or inter-robot communication at inference.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper demonstrates that a single VLA policy, when finetuned with robot-identity prompts, induces emergent, fully decentralized collaboration across heterogeneous robots.
By leveraging LoRA finetuning and cross-embodiment input handling, CHORUS achieves a 64 percentage point gain over decentralized from-scratch policies and outperforms centralized baselines.
The approach maintains constant runtime complexity with team size, eliminates inter-robot communication, and scales effectively to diverse, asynchronous multi-robot tasks.

CHORUS: Decentralized Multi-Embodiment Collaboration with a Single VLA Policy

Problem Setting and Motivation

Multi-robot collaboration for physical tasks (e.g., object transport, assembly) is a longstanding target in robotics. Centralized methods, which condition on joint multi-robot observations and output combined action vectors, are encumbered by scalability, communication, and synchronization issues as team size increases. Conversely, most decentralized approaches require per-robot policy training, explicit observation sharing, or post-hoc inference alignment mechanisms to mitigate partial observability—increasing engineering complexity and reducing flexibility in real-world settings.

The CHORUS framework investigates whether the visuomotor priors of large Vision-Language-Action (VLA) models, after finetuning, suffice to enable robust, scalable, fully decentralized multi-robot collaboration. The central hypothesis is that a single VLA policy, conditioned on robot-specific prompts and local sensory streams, can induce emergent collaborative behavior across diverse, asynchronous, and heterogeneous robot teams—without any inference-time information sharing or alignment.

Methodology

Policy Architecture

CHORUS adapts a pretrained To.5-VLA backbone for decentralized, multi-robot scenarios via low-rank (LoRA) finetuning. The key innovations are:

Unified Policy with Robot Identity Conditioning: A single parameter set governs all robot embodiments. At both training and deployment, observations are paired with explicit robot-identity prompts encoding morphology and role. The policy is never exposed to joint observations during training.
Cross-Embodiment Input/Output Handling: Padded action and observation formats ensure compatibility with varying robot morphologies, sensors (including frequency variations), and kinematic subspaces.
Decentralized Deployment: Each robot runs a local, independent policy copy, relying solely on its own observation stream and identity prompt. No team-level state is ever exchanged, and asynchronous actuation is natively supported.

Data Collection and Training Paradigm

Multi-robot demonstrations are acquired using teleoperated mobile manipulators, with each robot executing from its own perspective. All per-robot (observation, action, prompt) tuples across all embodiments are pooled. Training minimizes the flow-matching loss between predicted and true action chunks, with LoRA adapters targeting both the VLA encoder and action head.

Experimental Evaluation

Evaluations are conducted on mobile basket-lift, tape-measure, book-handover, and a three-robot transport task using ARX, Kinova, and YAM platform variants, under a suite of real-world conditions (including distractors and minor temporal mismatches). Comparisons include:

Centralized VLA Baseline: Unified policy with team-wide observations/actions requiring runtime synchronization and information sharing.
Decentralized, Per-Robot Imitation Learning: Diffusion policy 9 and per-robot finetuned VLA policy.
CHORUS Ablations: With/without cross-robot weight sharing.

Results

Transfer and Reactivity

Success Rate: CHORUS achieves a 64 percentage point gain over decentralized from-scratch diffusion policies and outperforms the centralized VLA baseline (even though the latter has more information at inference).
Teammate Reactivity: Experiments with scripted perturbations on one robot show CHORUS recovers effective collaborative behavior in ~2x more cases than per-robot-finetuned networks, indicating improved behavioral coupling and emergent anticipation via shared representation.

Scalability and Embodiment Diversity

Parameter and Context Window Complexity: CHORUS’ runtime and memory footprints remain constant with increasing team size, while centralized and non-shared approaches scale at least linearly in context length or parameter count.
Three-Robot Team Generalization: The same policy handles basket transport with three heterogeneous robots, attaining a 90% task success rate with no architectural alteration.

Analysis of Failure Modes

Failure cases in decentralized-from-scratch policies include premature or unsynchronized interaction, task confusion due to distractor objects, and reduced robustness in out-of-distribution settings. CHORUS mitigates these via strong pretrained visual priors and identity-conditioned reactivity.

Centralized baselines, despite their greater information bandwidth at inference, are prone to performance degradation due to increased input dimensionality and misalignment between pretraining and deployment datastreams—a known source of causal confusion in behavioral cloning.

Implications and Future Directions

CHORUS demonstrates that strong visuomotor priors in VLA models, post-finetuning, can enable parameter-efficient, highly scalable decentralized collaboration in heterogeneous multi-robot systems. Practical implications include:

Deployment Flexibility: No inter-robot communication or inference-time alignment protocols are necessary, simplifying system engineering and enhancing resilience to hardware/network heterogeneity.
Greater Reusability and Training Efficiency: Sharing weights across diverse embodiments reduces training cost and data collection burden, facilitating adaptation to new robots and tasks.

However, strictly synchronized low-level behaviors (e.g., simultaneous gripper triggers) remain outside the expressiveness of fully decentralized policies acting on local observations. Additionally, data sufficiency and coverage—especially for larger teams and tasks requiring complex role inference—remain key limitations, underlining the need for collaborative demonstration dataset curation and augmentation.

Conclusion

The CHORUS framework provides compelling empirical evidence that a single, prompt-conditioned VLA policy can induce robust, decentralized collaboration across multi-embodiment robot teams. This approach obviates per-robot policies and inference-time communication, matching or exceeding centralized and from-scratch decentralized baselines in both efficiency and task performance. These findings delineate a promising path towards generalist, scalable, and deployable multi-robot collaboration, and foreground the value of foundational models and shared representations in complex embodied AI systems.

Reference:

"CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy" (2606.12352)

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Plain-language summary of “CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy”

What is this paper about?

This paper is about teaching different robots to work together smoothly on shared tasks—like lifting a laundry basket, handing over a book, or measuring tape—without needing to constantly talk to each other or share the same camera view. The key idea is to use one shared “brain” (a single AI policy) that can run on many types of robots. Each robot looks around with its own cameras, reads a short “who am I” note, and then decides what to do next—all on its own, in sync with teammates simply by watching them.

The authors call this system CHORUS, like a group of singers who follow the same song but sing different parts. Here, the “song” is a single vision-language-action model (VLA) that has been adapted to control many robots at once, each from its own point of view.

What questions did the researchers ask?

The paper explores four easy-to-understand questions:

Does starting from a powerful, already-trained robot model help multi-robot teamwork more than training from scratch?
If every robot shares the exact same “brain” (the same model weights), does that make them better at reacting to each other’s movements?
Is a “centralized” approach (one big model that sees everything from all robots at once) actually better than “decentralized” (each robot uses only its own view)?
Can the same single model handle teams larger than two robots without changing the model’s design?

How did they do it?

Think of CHORUS as one shared playbook that any robot can use. Here’s how it works, explained with simple ideas:

Vision-Language-Action model (VLA): This is an AI that understands pictures (vision), words (language prompts), and can produce robot motions (actions). The authors start from a strong, pre-trained VLA “backbone” that already knows a lot about how robots act from past training. They then fine-tune it on multi-robot teamwork examples.
One model, many robots: Instead of training a separate model for each robot, they train just one. Each robot gives the model two things at each step:
- Its own camera views (what it sees).
- A short identity prompt (like a name tag) that says which robot it is and its role, so the model knows which body it’s controlling and what part it should play.
No robot-to-robot messaging at run time: Robots don’t send each other data during the task. They coordinate by simply watching each other in their own cameras, like soccer players reading teammates’ body language.
Training by demonstration: Humans teleoperated (remote-controlled) the robots to perform teamwork tasks. From these demonstrations, the model learns how each robot should move in response to what it sees. The training examples are split robot-by-robot (each sample contains one robot’s view and the actions it took), and then all robots’ samples are mixed into one big training set.
Works across different robot bodies: Different robots have different arms and speeds. The model handles this by:
- Using a “padded” action format that can fit actions for any robot type.
- Adjusting the planning chunk sizes so faster robots plan slightly longer action sequences, keeping everyone on the same time horizon.
- Balancing training so slower robots still get enough practice examples.
Lightweight fine-tuning: They add small, efficient adapters (LoRA) to the pre-trained model to learn teamwork skills without retraining the whole model from scratch.
Decentralized execution: In the real world, each robot runs its own copy of the model independently and at its own speed. Small timing differences are okay because the model reacts based on what it sees right now.

What did they find, and why does it matter?

Here are the main results from real-world tests on tasks like lifting a basket together, measuring with tape, handing over a book, and a three-robot move through a doorway:

Much better than training from scratch: CHORUS improved success rates by 64 percentage points over strong “from-scratch” methods (diffusion policies trained separately per robot). This shows that starting from a powerful pre-trained VLA gives robots helpful “instincts” for reacting to what they see.
More reactive teammates with shared weights: When all robots share the exact same model weights (one brain for all), they adapt to each other’s movements more reliably—about a 40 percentage-point improvement in a test where one robot was moved sideways during a handover. Sharing the brain seems to help robots learn a common understanding of each other’s behavior.
Outperforms centralized control in practice: Even though a centralized model sees more information (everyone’s cameras at once), CHORUS still matched or beat it. Why? The centralized setup pushes the model further from what it was trained on and grows the input size, which can hurt performance. CHORUS keeps inputs simple (one robot’s view), closer to the pre-training style, and handles small timing differences better.
Scales to three robots with no redesign: The same single model worked for a 3-robot task with a 90% success rate, without changing the architecture. That’s promising for larger teams.

What’s the big picture?

Easier to deploy: One model controls all robots, no special models per robot, and no robot-to-robot communication needed during the task. That can make real-world use cheaper and simpler.
More robust teamwork: Because each robot reacts to what it sees, small delays or slightly different speeds are less likely to break the team’s coordination.
Future impact: This approach could help create cleaning crews, moving teams, and warehouse assistants made up of diverse robots that can quickly learn to work together. It also supports higher-level planners (like LLMs that assign roles) by providing a strong low-level controller that carries out the actual motions on each robot.
Limitations and next steps: Some tasks need perfect, split-second synchronization (like opening two grippers at the exact same instant), which still favors centralized control. Also, training demos must show enough of the scene so each robot can see what it needs to coordinate locally. Finally, the community needs larger shared datasets of multi-robot teamwork to scale this approach further.

In short, CHORUS shows that one shared, pre-trained robot “brain” can power many different robots to collaborate well—just by looking, understanding a short role prompt, and reacting in real time.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of what remains missing, uncertain, or unexplored in the paper, written to be concrete and actionable for future work.

Robustness to partial observability: How does CHORUS perform when teammates or key objects are frequently occluded, out of view, or only intermittently visible, without the curated “always-visible” data-collection strategy?
Active perception under decentralization: Can robots learn to reposition sensors/cameras or reorient themselves to re-establish observability of teammates or task-relevant cues when visibility degrades?
Memory and state estimation: What is the impact of adding temporal memory (e.g., recurrent modules) or learned world models to mitigate partial observability beyond the action-chunk horizon?
Minimal communication trade-offs: What performance gains are achievable with tiny, bandwidth-limited signals (e.g., “ready,” “grasped,” or “release” bits) while preserving decentralization? Where is the Pareto frontier between no-communication and minimal-communication?
Strict synchronization tasks: Can a hybrid scheme combine decentralized CHORUS with occasional centralized synchronization for tasks requiring precise simultaneous actions (e.g., simultaneous gripper release)?
Scaling to larger teams: How does performance, latency tolerance, and failure rate scale beyond three robots, especially with highly heterogeneous sensors, morphologies, and control rates?
Dynamic team composition: How robust is the policy to robots joining/leaving mid-task, or to teammate failure/recovery? What mechanisms can enable handovers or role reallocation on the fly?
Generalization to unseen embodiments: To what extent can CHORUS control novel robot types zero/few-shot via only a new identity prompt, and what adaptation (if any) is required to handle new kinematics and sensor suites?
Role specification and adaptation: The paper uses a single, fixed role prompt per robot for the entire task; can roles be adapted online as the task progresses, or learned autonomously from demonstrations without explicit role text?
Prompt sensitivity: How sensitive is performance to the phrasing, length, or structure of the robot-identifying prompt, and can more structured identity/role tokens reduce brittleness?
Beyond vision-only inputs: What is the effect of incorporating proprioception, force/tactile sensing, or depth/3D perception on coordination, especially for force-critical or contact-rich cooperative tasks?
Latency robustness characterization: What are the quantitative tolerance bounds to inter-robot latency, network jitter, and compute delays before coordination degrades? Can adaptive chunking or resynchronization mitigate larger delays?
Control-rate heterogeneity: The paper scales chunk sizes to align horizons; how do different horizon lengths and control-rate ratios affect stability, reactivity, and error accumulation?
Learning decentralized strategies without curated views: Can robots learn collaboration strategies that are not engineered to keep teammates in view (e.g., using anticipation, memory, or environment cues) and still succeed?
Centralized baseline design: Does a stronger centralized architecture (e.g., token-based multi-view fusion, cross-attention across agents, pretraining with multi-robot semantics) close or reverse the performance gap?
Training efficiency vs. negative transfer: When does weight sharing across embodiments help versus hurt (negative transfer), and how can curriculum, adapter designs, or mixture-of-experts reduce interference across robots?
Data scale and diversity: What are the data requirements (demo counts, environment diversity, distractors) to sustain performance as tasks become more complex or as teams grow, and how to efficiently collect multi-robot data at scale?
Simulation-to-real for collaboration: Can large-scale simulated multi-robot data (with domain randomization) pretrain collaborative skills that transfer to real robots, reducing costly multi-operator teleoperation?
Robustness to disturbances: Beyond the handover lateral perturbation, how does CHORUS respond to broader perturbations (object slippage, unexpected human interference, moving obstacles) and can robust training improve recovery?
Failure detection and contingency planning: How can agents detect teammate failure/mis-grasp and trigger recovery behaviors or alternative plans under decentralized execution?
Safety and collision avoidance: What guarantees or runtime checks can ensure safe separation and compliant interactions among multiple robots when all agents act from local observations?
Long-horizon task decomposition: The approach uses a single prompt per robot; how to integrate high-level task decomposition, subgoal tracking, and role-switching over long horizons without centralized planners?
Evaluation breadth and metrics: The tasks are limited and success rate–centric; standardized benchmarks, ablations on occlusion/latency/FOV overlap, and metrics for coordination quality, safety, and resource usage would strengthen conclusions.
Resources and deployment constraints: What compute, memory, and energy budgets are required to run a 3B-class VLA per robot at 15–30 Hz, and how do model size, quantization, and batching affect latency and success?
Asynchronous logging and clock drift: Training assumes synchronous logging; how robust is learning and execution to timestamp noise, missing frames, and clock drift common in distributed robot systems?
Teammate modeling and interpretability: Does the shared backbone internally model teammate actions or roles? Can we interpret or probe the representations to understand how reactivity emerges?
Human–robot collaboration: Can CHORUS collaborate with humans (as teammates) under the same decentralized assumptions, and what changes to prompts, data collection, or safety layers are needed?
Domain shift and distractors: While diffusion baselines struggled with distractors, the paper does not quantify CHORUS’s robustness under heavy clutter or visually similar non-targets; systematic stress tests are needed.
Recovery from perception failures: How does the system handle camera dropouts, blur, lighting changes, or sensor mismatches across robots, and can redundancy or sensor fusion improve resilience?

View Paper Prompt View All Prompts

Practical Applications

Practical Applications of CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy

Below are actionable, real-world applications that flow from the paper’s findings and innovations. Each item indicates sectors, concrete use cases, enabling tools/workflows, and feasibility assumptions. Applications are grouped by deployment readiness.

Immediate Applications

Collaborative pick-up, carry, and place for light payloads — Robotics, Warehousing, Facilities
- Use case: Two mobile manipulators jointly lift laundry baskets, totes, or bins and carry them through doorways or to staging areas (mirrors the paper’s basket-lift and door-navigation tasks).
- Tools/products/workflows: “Team Controller” using a single VLA policy plus robot-identity prompts; ROS 2 node for decentralized execution; LoRA fine-tuning service on in-house demos; prompt library for roles (e.g., “left lifter,” “right lifter”).
- Assumptions/dependencies: Each robot has local cameras that keep the teammate and payload visible; tasks tolerate minor asynchrony; adapters map the 32-dim padded actions to each robot’s controller; small set of teleoperated demos available.
Item handover and relay between heterogeneous robots — Robotics, Logistics, Healthcare (intralogistics), Libraries
- Use case: Passing tools, documents, medications in sealed containers, or books from one robot to another, or from stationary to mobile arms (paper’s book-handover).
- Tools/products/workflows: Decentralized “handover skill” prompts per embodiment; on-device execution to reduce reliance on network; role-specific identity prompts (giver/receiver).
- Assumptions/dependencies: Visual observability of the item and partner; handover windows can be reactive, not precise to a control tick.
Reactive “hold-and-assist” operations — Construction, Facilities, Retail Fit-Out
- Use case: One robot holds/positions objects (e.g., tape measure box, panel edge) while another manipulates or measures (paper’s tape measurement).
- Tools/products/workflows: Predefined prompt templates for “holder” and “actor”; LoRA-adapted VLA policy reused across multiple embodiments and tasks; minimal setup without shared cameras.
- Assumptions/dependencies: Reliable wrist or top camera keeping teammate and workpiece in frame; minor role timing offsets are acceptable.
Cross-vendor fleet coordination without runtime communication — Robotics Integrators, Enterprises with mixed fleets
- Use case: Teams composed of different robot brands cooperate using the same shared backbone, deployed independently on each device; avoids cross-vendor comms protocols.
- Tools/products/workflows: Identity-prompt registry mapping vendor/arm type to roles; per-robot runtime container hosting the policy; centralized CI/CD for LoRA adapters; per-embodiment calibration to map action padding to controllers.
- Assumptions/dependencies: Reasonable time sync and latency; sufficient visual overlap so teammates infer states from local views.
Cost- and parameter-efficient multi-robot policy training — Academia, Startups
- Use case: Train one set of weights for multiple robots instead of per-robot policies; maintain constant context length as team size grows (improves scalability and reduces cost).
- Tools/products/workflows: Batch robot-sampler with weighting for heterogeneous control rates; LoRA-based finetuning on pooled multi-robot tuples; evaluation scripts for teammate reactivity.
- Assumptions/dependencies: Access to a pretrained VLA (e.g., To.5-like backbone) and small multi-robot demo datasets; adherence to the cross-embodiment action format.
Privacy-preserving, decentralized operation in sensitive environments — Healthcare, Corporate, Defense Facilities
- Use case: Deploy collaborative behaviors without sharing video or proprioception between robots at runtime; reduces compliance and privacy risks.
- Tools/products/workflows: On-prem/on-device inference; audit-ready prompt catalog; per-robot logging without cross-robot data exchange.
- Assumptions/dependencies: Policies must rely solely on each robot’s own sensors; safety and oversight procedures in place.
Teaching labs and coursework in collaborative robotics — Education, Academia
- Use case: Student labs demonstrate multi-robot collaboration using identity prompts and pooled demos across heterogeneous platforms; focus on decentralized control and reactivity.
- Tools/products/workflows: Curriculum modules on collecting multi-robot demonstrations, prompt design, and asynchronous execution; open-source ROS 2 integration templates; grading rubrics for success/reactivity metrics.
- Assumptions/dependencies: Access to two+ mobile manipulators and a dual-teleop interface; basic ML infra for LoRA finetuning.
Benchmarking and evaluation of multi-robot reactivity — Academia, R&D Groups
- Use case: Standardized experiments to quantify teammate reactivity under perturbations (e.g., lateral displacement in handover), comparing shared vs per-robot weights.
- Tools/products/workflows: Perturbation scripts; reactivity metrics and logging; ablations for weight-sharing and centralization.
- Assumptions/dependencies: Reproducible scenes and sensors; comparable backbone initializations across baselines.

Long-Term Applications

Construction crews of robots for cooperative assembly and transport — Construction, Industrial Services
- Use case: Multi-robot teams carry beams, hold drywall sheets, align fixtures, and navigate complex sites.
- Tools/products/workflows: Role-conditioned prompt sets per task phase; integration with building information models (BIM) and site localization; mixed-rate, larger teams using the same policy.
- Assumptions/dependencies: Larger collaborative datasets in construction contexts; stronger perception under occlusion and dust; optional hybrid centralized-decentralized control for strict synchronization steps.
Household chore teams and assisted living support — Consumer Robotics, Eldercare
- Use case: Two or more home robots collaboratively make beds, move furniture, manage laundry, pass objects to people.
- Tools/products/workflows: Consumer-facing “skill store” with role prompts; automatic scene-adaptive prompting; edge inference on low-power devices.
- Assumptions/dependencies: Robustness to clutter/occlusion; safety certification and human-in-the-loop overrides; diverse in-home collaborative training data.
Hospital logistics teams with privacy-by-design coordination — Healthcare
- Use case: Teams that shuttle medications, linens, and devices via handovers and joint transport, operating across wards without streaming inter-robot feeds.
- Tools/products/workflows: IT-approved decentralized runtimes; standardized identity-prompt taxonomy (e.g., “handover nurse-station bot,” “corridor lifter”); integration with hospital scheduling.
- Assumptions/dependencies: Regulatory alignment; sterile and safety protocols; proven reliability in crowded, dynamic hallways.
Multi-vendor warehouse and retail fulfillment — Warehousing, Retail
- Use case: Mixed fleets perform bin transfers, tote handovers, co-carry long items, and doorway/aisle negotiation without shared comms.
- Tools/products/workflows: Vendor-neutral “Team Controller SDK” for ROS 2; prompt/version governance across vendors; telemetry for task-level SLAs.
- Assumptions/dependencies: Visual line-of-sight among collaborators; domain-specific fine-tuning for lighting, shelving, and aisle geometry.
Disaster response teams (ground + aerial) for cooperative manipulation — Public Safety, Emergency Management
- Use case: Aerial drones hold light fixtures/lines while UGVs cut or fasten; robots cooperatively move debris or pass supplies through apertures.
- Tools/products/workflows: Cross-modality identity prompts (UGV/UAV); ruggedized sensing; contingency prompts for degraded comms.
- Assumptions/dependencies: Expanded pretraining to include outdoor/adverse conditions; safety envelopes for multi-robot proximity; partial observability under heavy occlusion.
Agricultural co-manipulation and transfer — Agriculture
- Use case: Robots hand off crates, co-carry harvest bins, or support vine training by holding and placing trellises.
- Tools/products/workflows: Farm-specific prompt sets; seasonal finetunes; GPS/RTK-aware role conditioning for large plots.
- Assumptions/dependencies: Robust visual perception in bright, variable outdoor scenes; terrain-aware control adapters.
Integration with high-level role assignment and planning — Software, Robotics
- Use case: LLM-based task decomposition assigns roles (who holds, who grasps, who opens door), while CHORUS handles low-level decentralized execution.
- Tools/products/workflows: Planner-controller API bridge; prompt auto-generation from plans; monitoring that adapts prompts mid-task.
- Assumptions/dependencies: Reliable interfaces between planners and VLA controllers; guardrails for plan-controller mismatch.
Standardization of robot identity prompts and cross-embodiment action interfaces — Policy, Standards, Robotics
- Use case: Industry-wide schemas for identity prompts, action padding, and control adapters to ensure interoperability of decentralized collaboration.
- Tools/products/workflows: Standards bodies define prompt/adapter specifications; compliance test suites; certification programs.
- Assumptions/dependencies: Multi-stakeholder agreement; evidence from large-scale deployments; open reference implementations.
Hybrid control for strict synchrony steps — Advanced Manufacturing, Robotics
- Use case: Tasks needing exact simultaneous actions (e.g., dual gripper release) combine mostly decentralized execution with momentary centralized synchronization.
- Tools/products/workflows: “Sync points” in prompts; fallback centralized micro-controllers; verification of timing constraints.
- Assumptions/dependencies: Clear identification of steps requiring hard synchrony; reliable clocking and low-latency links.
Large-scale collaborative datasets and simulation-to-real pipelines — Academia, Tooling Vendors
- Use case: Public multi-robot datasets spanning diverse embodiments and tasks; sim-first data generation with real-world fine-tunes to improve generalization.
- Tools/products/workflows: Dual/multi-teleop data capture suites; dataset curation tools for multi-robot tuples; scalable LoRA training services.
- Assumptions/dependencies: Community investment in data collection; standardized logging across teams; sim environments with realistic multi-robot visibility constraints.

Notes on Feasibility and Dependencies

Visual observability is critical: demonstrations and deployments must ensure each robot’s cameras capture both the teammate and the workspace throughout the interaction.
Tasks must tolerate minor asynchrony: CHORUS absorbs small latency/control-rate differences but is not designed for actions requiring exact simultaneous control steps.
Pretrained VLA backbone availability and adaptation: success relies on strong visuomotor priors and efficient LoRA fine-tuning; cross-embodiment action adapters are required.
Safety, compliance, and reliability: especially in healthcare and public spaces, deployments need safety envelopes, fail-safes, and human override mechanisms.
Data requirements: while CHORUS reduces training burden via weight sharing, multi-robot collaborative demonstrations remain necessary; broader public datasets would accelerate adoption.

View Paper Prompt View All Prompts

Glossary

AdamW: An optimization algorithm that decouples weight decay from gradient updates to improve training stability. "We optimize with AdamW [60] under a cosine learning rate schedule."
action chunk: A contiguous sequence of future actions predicted or executed over a fixed horizon. "At = (at, ... , at+H-1) is robot r's action chunk of horizon H"
action space: The set of all possible actions a policy can output for control. "the context window and action space grow with team size,"
asynchronous execution: Running agents or control loops without step-wise synchronization across robots. "Local conditioning in Eq. 3 supports asynchronous execution, which we use in our evaluations:"
bimanual manipulation: Coordinated control of two arms (often on one robot) to achieve manipulation tasks. "Much of this data is bimanual, and bimanual manipulation can be viewed as a simplified form of multi-robot collaboration,"
behavior cloning: Learning a policy by supervised imitation of expert demonstrations. "behavior cloning performance can degrade as the input dimension grows."
centralized policy: A single controller that consumes joint observations from all robots and outputs actions for the entire team. "train a single centralized policy that conditions on team-wide observations and produces actions for all robots in a single forward pass"
chunk size: The number of actions included in each predicted action chunk for a robot. "we scale each robot's chunk size proportionally to its control rate;"
context length: The total number of tokens or inputs in the model’s conditioning window. "VLA Centralized scales linearly in context length,"
context window: The fixed-size input window the model conditions on during inference. "CHORUS keeps parameters & context window length constant in team size."
control frequency: The rate (in Hz) at which a robot’s controller outputs actions. "robots can run independently at different control frequencies,"
control rate: The specific update frequency required or used for synchronous control across robots. "This requires a shared control rate across the team;"
cosine learning rate schedule: A training schedule where the learning rate follows a cosine curve over time. "We optimize with AdamW [60] under a cosine learning rate schedule."
cross-embodiment format: A policy input/output representation that supports multiple robot morphologies uniformly. "We inherit the backbone's cross-embodiment format: padded action vectors of dimension 32 and a variable number of image tokens per observation [57]."
decentralized diffusion: A from-scratch imitation learning approach that trains separate diffusion policies per robot without centralized inputs. "decentralized diffusion, which trains a separate diffusion policy per robot"
decentralized execution: Each robot runs its own controller using only local observations, without runtime communication. "CHORUS targets decentralized execution because it requires no inter-robot communication at runtime"
diffusion policy: A policy parameterized via a diffusion model that generates actions by denoising from noise. "decentralized diffusion, which trains a separate diffusion policy per robot"
distribution shift: A mismatch between training and deployment data distributions that can degrade performance. "distribution shift from pretraining"
flow-matching loss: A training objective that matches model-predicted velocities to the target flow in a noised data space. "We optimize the flow-matching loss inherited from the back- bone [57] over the pooled single-robot dataset D:"
horizon H: The number of future timesteps over which actions are predicted or planned. "action chunk of horizon H"
image tokens: Discrete visual embeddings fed to the model to represent camera observations. "a variable number of image tokens per observation [57]."
imitation learning: Learning to act by mimicking expert demonstrations rather than optimizing a reward. "from-scratch imitation learning approach [9]"
joint distribution: A probability distribution over combined variables, here over all robots’ actions conditioned on joint observations. "centralized formulations model the joint distribution TT(At, . . . , Atv | of, ... , of),"
latent theory-of-mind module: A component that infers teammates’ intentions as latent variables to aid coordination. "uses a latent theory-of-mind module involving an online alignment procedure"
LoRA adapters: Low-Rank Adaptation modules that fine-tune large models efficiently by adding trainable low-rank matrices. "We fine-tune the backbone with LoRA adapters [59] of rank 16 and 32"
multi-agent RL (MARL): Reinforcement learning methods where multiple agents learn policies, often with shared training structures. "Multi-agent RL (MARL) methods often share critics or mix networks during training while executing on local observations"
multi-embodiment collaboration: Coordination among robots with differing morphologies, sensors, and action spaces. "a single VLA policy trained for decentralized, multi-embodiment collabo- ration."
padded action vectors: Fixed-length action representations with padding to accommodate different robot action dimensions. "padded action vectors of dimension 32"
partial observability: Each agent has limited access to the full state, receiving only local observations. "A key tradeoff of decentralization is partial observability:"
proprioceptive state: Internal robot measurements (e.g., joint positions, velocities) describing its body configuration. "such as conditioning on proprioceptive state from teammates [5],"
robot-identifying prompt: A textual prefix specifying which robot the shared policy should control and its role. "We supply this information through a robot- identifying prompt Cr"
Robot Sampler: A data loader that balances per-robot tuples when forming training batches. "The robot sampler composes each training batch from single-robot tuples (ot, At, Cr) drawn independently from D,"
teleoperation interface: A human-in-the-loop control system for collecting demonstrations via remote operation. "via the TidyBot++ teleoperation interface [58]."
Vision-Language-Action (VLA) model: A multimodal model that conditions on visual inputs and language to output actions. "vision-language-action (VLA) models [7, 8]"
Visuomotor priors: Learned assumptions linking visual inputs to motor actions that guide reactive behavior. "Our key insight is that strong visuomotor priors may be sufficient to enable decentralized, multi- embodiment collaboration"
weight sharing: Training a single set of parameters used by multiple robots rather than separate per-robot models. "CHORUS (w/o Weight-Sharing) ablates weight sharing (WS) by training a separate policy per robot"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy

Summary

CHORUS: Decentralized Multi-Embodiment Collaboration with a Single VLA Policy

Problem Setting and Motivation

Methodology

Policy Architecture

Data Collection and Training Paradigm

Experimental Evaluation

Results

Transfer and Reactivity

Scalability and Embodiment Diversity

Analysis of Failure Modes

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Plain-language summary of “CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy”

What is this paper about?

What questions did the researchers ask?

How did they do it?

What did they find, and why does it matter?

What’s the big picture?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Practical Applications of CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy

Immediate Applications

Long-Term Applications

Notes on Feasibility and Dependencies

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy

Summary

CHORUS: Decentralized Multi-Embodiment Collaboration with a Single VLA Policy

Problem Setting and Motivation

Methodology

Policy Architecture

Data Collection and Training Paradigm

Experimental Evaluation

Results

Transfer and Reactivity

Scalability and Embodiment Diversity

Analysis of Failure Modes

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Plain-language summary of “CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy”

What is this paper about?

What questions did the researchers ask?

How did they do it?

What did they find, and why does it matter?

What’s the big picture?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Practical Applications of CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy

Immediate Applications

Long-Term Applications

Notes on Feasibility and Dependencies

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research