Papers
Topics
Authors
Recent
Search
2000 character limit reached

BEHAVIOR Robot Suite (BRS)

Updated 5 February 2026
  • BRS is an integrated open-source framework for real-world whole-body household manipulation, featuring bimanual coordination, omnidirectional navigation, and extensive reach.
  • It employs the JoyLo teleoperation system to capture high-fidelity demonstrations, enabling scalable policy learning via a diffusion–transformer visuomotor model.
  • Empirical evaluations show that WB-VIMA achieves superior success rates and precise manipulation across multi-stage tasks compared to prior visuomotor baselines.

The BEHAVIOR Robot Suite (BRS) is an open-source, comprehensive framework designed to address the challenges of real-world whole-body manipulation in everyday household tasks. It integrates a customized bimanual mobile manipulator, a scalable whole-body teleoperation interface, and a structured diffusion–transformer visuomotor policy learning algorithm. BRS systematically advances three core capabilities essential for household manipulation: bimanual coordination, stable and precise omnidirectional navigation, and extensive whole-body reachability—enabling manipulation tasks such as opening doors, loading dishwashers, and operating in confined or cluttered spaces (Jiang et al., 7 Mar 2025).

1. Robotic Platform: System Architecture and Hardware Design

BRS is embodied in the Galaxea R1 mobile manipulator, whose hardware explicitly prioritizes whole-body control and reach.

  • Arms: Two 6-DoF manipulator arms (923 mm span, 128 mm width) each terminate in a parallel-jaw gripper with 5 kg payload. Arms are actuated via joint-level impedance controllers with diagonal stiffness and damping gains Kp=[140,200,120,20,20,20]K_p = [140, 200, 120, 20, 20, 20], Kd=[10,50,5,1,1,0.4]K_d = [10, 50, 5, 1, 1, 0.4].
  • Torso: Four revolute joints provide yaw (waist, ±3.05 rad), hip pitch ([–2.09, 1.83] rad), and two “knee-like” joints (–2.79→2.53 rad; –1.13→1.83 rad). In its upright configuration, the torso offers 1223 mm vertical reach; fully folded, it enables squatting postures for ground-level manipulation.
  • Base: The omnidirectional base integrates three wheels with three steering motors, with velocity bounds vx,vy[1.5,1.5]v_x,v_y \in [–1.5,1.5] m/s, ωz[3,3]\omega_z \in [–3,3] rad/s, and accelerations ax[2.5,2.5]a_x \in [–2.5,2.5] m/s² and ay,aω[1,1]a_y, a_\omega \in [–1,1] in their respective units. Platform ground clearance is 30 mm.
  • Workspace and Reach: The full kinematic chain produces an end-effector workspace from floor level up to 2.0 m vertically and 2.06 m horizontally, sufficient for real-world household object height distributions.

Forward kinematic mapping employs the Denavit–Hartenberg convention with

Tii1=Rotz(θi)Transz(di)Transx(ai)Rotx(αi),T_i^{i-1} = Rot_z(\theta_i) Trans_z(d_i) Trans_x(a_i) Rot_x(\alpha_i),

cascaded as T06=i=16Tii1(θi,di,ai,αi)T_0^6 = \prod_{i=1}^6 T_i^{i-1}(\theta_i,d_i,a_i,\alpha_i).

2. Whole-Body Teleoperation and Demonstration Collection (JoyLo)

Task demonstration and data collection are achieved via JoyLo, a low-cost, puppet-style teleoperation interface designed for scalable, whole-body demonstration.

  • Leader-Follower Kinematic Mapping: JoyLo consists of two 3D-printed leader arms (Dynamixel XL330) each topped with a Nintendo Joy-Con. Leader arms are kinematically coupled to the robot’s arms using joint-level bilateral teleoperation: τ=Kp(qrobotqJoyLo)+Kd(q˙robotq˙JoyLo)Kdmp\tau = K_p (q_{robot} – q_{JoyLo}) + K_d (\dot{q}_{robot} – \dot{q}_{JoyLo}) – K_{dmp} where τ\tau is the applied torque and qrobotq_{robot}, qJoyLoq_{JoyLo} are the joint angles.
  • Base and Torso Control: The left Joy-Con’s thumbstick issues base planar velocity commands; the right Joy-Con controls torso yaw/hip velocities. Arrow keys mediate fast posture switches, and triggers actuate the grippers.
  • Data Fidelity: Demonstrations stream all sensory and actuation records (RGB-D images, point clouds, joint states, odometry, actions) at 10 Hz, with control at 100 Hz. The interface design prevents infeasible or singular robot states, ensuring high replay success for collected trajectories.

This teleoperation system is both cost-effective and robust, supporting high-quality, whole-body demonstrations used in policy learning.

3. Visuomotor Policy Learning: The WB-VIMA Algorithm

BRS’s core visuomotor policy module is WB-VIMA (Whole-Body VisuoMotor Attention), which uses a diffusion–transformer architecture tailored for high-DoF, hierarchical whole-body action modeling. The framework accommodates 21 degrees of freedom (3 base, 4 torso, 14 for arms/grippers) and is decomposed as follows:

  • MDP and Diffusion Modeling: Manipulation is formalized as an MDP (S,A,T,R)(S, A, T, R). Expert action distributions p(as)p(a \mid s) are learned with a denoising diffusion model (DDPM), where the forward step is

ak=1βkak1+βkϵ,  ϵN(0,I)a^k = \sqrt{1 - \beta_k} a^{k-1} + \sqrt{\beta_k} \epsilon, \; \epsilon \sim \mathcal{N}(0, I)

with the reverse process predicting ϵ\epsilon.

  • Multi-Modal Observation Encoding:
    • Visual: Ego-centric, colored point clouds PRN×6P \in \mathbb{R}^{N \times 6} (RGB+XYZ) are PointNet encoded as tokens EpcdE^{pcd}.
    • Proprioceptive: Joint states and gripper widths are encoded via MLP as tokens EpropE^{prop}.
  • Temporal and Hierarchical Action Modeling: Sequential tokens S=[EtTo+1pcd,EtTo+1prop,...,Eta]S = [E^{pcd}_{t−T_o+1}, E^{prop}_{t−T_o+1}, ..., E^a_t] feed a two-layer transformer decoder (embedding size 256, eight attention heads). Causal masking ensures autoregressive action prediction.
  • Autoregressive Denoising Hierarchy: Action prediction employs three UNet-based denoisers for the base (ϵbase\epsilon_{base}), torso (ϵtorso\epsilon_{torso}), and arms (ϵarms\epsilon_{arms}). Conditioning is structured as:
    • Denoise base action abasek1a_{base}^{k-1} from noised abaseka_{base}^k.
    • Denoise torso conditioned on abase0a_{base}^0.
    • Denoise the arms conditioned on atorso0a_{torso}^0, abase0a_{base}^0.
    • This mitigates error propagation along the kinematic chain.

WB-VIMA is trained on approximately 560 teleoperated demonstrations (10 Hz sampling) across five tasks, using AdamW optimizer (lr=7×104lr=7\times 10^{-4}, decay=0.1), 100 diffusion steps for training, and 16 steps at inference. Inference latency is approximately 20 ms on an RTX 4090.

4. Empirical Evaluation on Household Manipulation Tasks

BRS is validated on five long-horizon, multi-stage household tasks in real, unmodified environments. Each task emphasizes distinct whole-body capabilities and real-world complexities.

Task Key Capability Stages Duration (s)
Clean House After a Wild Party Navigation 6 210
Clean the Toilet Reachability 6 120
Take Trash Outside Navigation 4 130
Put Items onto Shelves Reachability 2 60
Lay Clothes Out Bimanual 4 120
  • Metrics: Sub-task and end-to-end (ET) success rates, safety violations (e.g., collision or motor overload).
  • Baselines: DP3 (3D point-cloud diffusion) and RGB-DP (RGB image diffusion). Human teleoperation via JoyLo serves as an upper bound.
  • Results: WB-VIMA achieves 58% ET success on average (peak 93%) versus 4% (DP3) and 3% (RGB-DP). Safety violations by WB-VIMA are near-zero and, in specific contact-rich subtasks (e.g., manipulating toilet covers or wardrobe doors), WB-VIMA surpasses human demonstrations.

Ablation studies show that removing the autoregressive denoising module reduces success by up to 53% in whole-body coordination tasks. Omitting multi-modal attention leads to proprioceptive overfitting and environmental collisions. Observed emergent policies include coordinated hip+”knee” bending and base inertia use for contact-rich operations, and torso tilting for extended reach and error recovery.

5. Limitations and Future Directions

BRS policies are currently trained for a single robot embodiment; cross-embodiment policy transfer has yet to be explored. Scene generalization is limited; integration of large-scale pre-trained vision-LLMs (such as VLA) is proposed for future work. Leveraging synthetic demonstrations or human motion-capture could further extend data scale and generalization.

A plausible implication is that further integration of generalized perception models and expanded demonstration domains could address current generalization limitations.

6. Key Contributions and Significance

BRS provides an open-source, end-to-end platform that brings together: (1) a cost-effective, kinematically-consistent whole-body teleoperation system (JoyLo) validated for data fidelity and ease of use, (2) WB-VIMA, a novel diffusion–transformer policy explicitly structured for the manipulation kinematic hierarchy and multi-modal sensor fusion, and (3) comprehensive, real-robot validation across multi-stage, long-horizon household activities with substantial empirical improvement over prior visuomotor baselines (Jiang et al., 7 Mar 2025).

This unified approach supports robust data collection, scalable policy learning for high-DoF mobile manipulators, and advances empirical benchmarks in real-world household robot assistance. All associated code, hardware designs, and pre-trained models are available at https://behavior-robot-suite.github.io/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BEHAVIOR Robot Suite (BRS).