BEHAVIOR Robot Suite (BRS)
- BRS is an integrated open-source framework for real-world whole-body household manipulation, featuring bimanual coordination, omnidirectional navigation, and extensive reach.
- It employs the JoyLo teleoperation system to capture high-fidelity demonstrations, enabling scalable policy learning via a diffusion–transformer visuomotor model.
- Empirical evaluations show that WB-VIMA achieves superior success rates and precise manipulation across multi-stage tasks compared to prior visuomotor baselines.
The BEHAVIOR Robot Suite (BRS) is an open-source, comprehensive framework designed to address the challenges of real-world whole-body manipulation in everyday household tasks. It integrates a customized bimanual mobile manipulator, a scalable whole-body teleoperation interface, and a structured diffusion–transformer visuomotor policy learning algorithm. BRS systematically advances three core capabilities essential for household manipulation: bimanual coordination, stable and precise omnidirectional navigation, and extensive whole-body reachability—enabling manipulation tasks such as opening doors, loading dishwashers, and operating in confined or cluttered spaces (Jiang et al., 7 Mar 2025).
1. Robotic Platform: System Architecture and Hardware Design
BRS is embodied in the Galaxea R1 mobile manipulator, whose hardware explicitly prioritizes whole-body control and reach.
- Arms: Two 6-DoF manipulator arms (923 mm span, 128 mm width) each terminate in a parallel-jaw gripper with 5 kg payload. Arms are actuated via joint-level impedance controllers with diagonal stiffness and damping gains , .
- Torso: Four revolute joints provide yaw (waist, ±3.05 rad), hip pitch ([–2.09, 1.83] rad), and two “knee-like” joints (–2.79→2.53 rad; –1.13→1.83 rad). In its upright configuration, the torso offers 1223 mm vertical reach; fully folded, it enables squatting postures for ground-level manipulation.
- Base: The omnidirectional base integrates three wheels with three steering motors, with velocity bounds m/s, rad/s, and accelerations m/s² and in their respective units. Platform ground clearance is 30 mm.
- Workspace and Reach: The full kinematic chain produces an end-effector workspace from floor level up to 2.0 m vertically and 2.06 m horizontally, sufficient for real-world household object height distributions.
Forward kinematic mapping employs the Denavit–Hartenberg convention with
cascaded as .
2. Whole-Body Teleoperation and Demonstration Collection (JoyLo)
Task demonstration and data collection are achieved via JoyLo, a low-cost, puppet-style teleoperation interface designed for scalable, whole-body demonstration.
- Leader-Follower Kinematic Mapping: JoyLo consists of two 3D-printed leader arms (Dynamixel XL330) each topped with a Nintendo Joy-Con. Leader arms are kinematically coupled to the robot’s arms using joint-level bilateral teleoperation: where is the applied torque and , are the joint angles.
- Base and Torso Control: The left Joy-Con’s thumbstick issues base planar velocity commands; the right Joy-Con controls torso yaw/hip velocities. Arrow keys mediate fast posture switches, and triggers actuate the grippers.
- Data Fidelity: Demonstrations stream all sensory and actuation records (RGB-D images, point clouds, joint states, odometry, actions) at 10 Hz, with control at 100 Hz. The interface design prevents infeasible or singular robot states, ensuring high replay success for collected trajectories.
This teleoperation system is both cost-effective and robust, supporting high-quality, whole-body demonstrations used in policy learning.
3. Visuomotor Policy Learning: The WB-VIMA Algorithm
BRS’s core visuomotor policy module is WB-VIMA (Whole-Body VisuoMotor Attention), which uses a diffusion–transformer architecture tailored for high-DoF, hierarchical whole-body action modeling. The framework accommodates 21 degrees of freedom (3 base, 4 torso, 14 for arms/grippers) and is decomposed as follows:
- MDP and Diffusion Modeling: Manipulation is formalized as an MDP . Expert action distributions are learned with a denoising diffusion model (DDPM), where the forward step is
with the reverse process predicting .
- Multi-Modal Observation Encoding:
- Visual: Ego-centric, colored point clouds (RGB+XYZ) are PointNet encoded as tokens .
- Proprioceptive: Joint states and gripper widths are encoded via MLP as tokens .
- Temporal and Hierarchical Action Modeling: Sequential tokens feed a two-layer transformer decoder (embedding size 256, eight attention heads). Causal masking ensures autoregressive action prediction.
- Autoregressive Denoising Hierarchy: Action prediction employs three UNet-based denoisers for the base (), torso (), and arms (). Conditioning is structured as:
- Denoise base action from noised .
- Denoise torso conditioned on .
- Denoise the arms conditioned on , .
- This mitigates error propagation along the kinematic chain.
WB-VIMA is trained on approximately 560 teleoperated demonstrations (10 Hz sampling) across five tasks, using AdamW optimizer (, decay=0.1), 100 diffusion steps for training, and 16 steps at inference. Inference latency is approximately 20 ms on an RTX 4090.
4. Empirical Evaluation on Household Manipulation Tasks
BRS is validated on five long-horizon, multi-stage household tasks in real, unmodified environments. Each task emphasizes distinct whole-body capabilities and real-world complexities.
| Task | Key Capability | Stages | Duration (s) |
|---|---|---|---|
| Clean House After a Wild Party | Navigation | 6 | 210 |
| Clean the Toilet | Reachability | 6 | 120 |
| Take Trash Outside | Navigation | 4 | 130 |
| Put Items onto Shelves | Reachability | 2 | 60 |
| Lay Clothes Out | Bimanual | 4 | 120 |
- Metrics: Sub-task and end-to-end (ET) success rates, safety violations (e.g., collision or motor overload).
- Baselines: DP3 (3D point-cloud diffusion) and RGB-DP (RGB image diffusion). Human teleoperation via JoyLo serves as an upper bound.
- Results: WB-VIMA achieves 58% ET success on average (peak 93%) versus 4% (DP3) and 3% (RGB-DP). Safety violations by WB-VIMA are near-zero and, in specific contact-rich subtasks (e.g., manipulating toilet covers or wardrobe doors), WB-VIMA surpasses human demonstrations.
Ablation studies show that removing the autoregressive denoising module reduces success by up to 53% in whole-body coordination tasks. Omitting multi-modal attention leads to proprioceptive overfitting and environmental collisions. Observed emergent policies include coordinated hip+”knee” bending and base inertia use for contact-rich operations, and torso tilting for extended reach and error recovery.
5. Limitations and Future Directions
BRS policies are currently trained for a single robot embodiment; cross-embodiment policy transfer has yet to be explored. Scene generalization is limited; integration of large-scale pre-trained vision-LLMs (such as VLA) is proposed for future work. Leveraging synthetic demonstrations or human motion-capture could further extend data scale and generalization.
A plausible implication is that further integration of generalized perception models and expanded demonstration domains could address current generalization limitations.
6. Key Contributions and Significance
BRS provides an open-source, end-to-end platform that brings together: (1) a cost-effective, kinematically-consistent whole-body teleoperation system (JoyLo) validated for data fidelity and ease of use, (2) WB-VIMA, a novel diffusion–transformer policy explicitly structured for the manipulation kinematic hierarchy and multi-modal sensor fusion, and (3) comprehensive, real-robot validation across multi-stage, long-horizon household activities with substantial empirical improvement over prior visuomotor baselines (Jiang et al., 7 Mar 2025).
This unified approach supports robust data collection, scalable policy learning for high-DoF mobile manipulators, and advances empirical benchmarks in real-world household robot assistance. All associated code, hardware designs, and pre-trained models are available at https://behavior-robot-suite.github.io/.