Unitree G1 Humanoid Robot

Updated 4 July 2026

Unitree G1 is a dexterous humanoid research platform that supports real-hardware studies in navigation, manipulation, energy modeling, and cybersecurity assessment.
It leverages sim-to-real calibration and diffusion-based control to achieve precise locomotion and contact-rich manipulation under varied experimental conditions.
The platform’s evaluations reveal actionable insights on embodied control challenges and subsystem instrumentation for energy and cyber defense optimization.

Searching arXiv for papers on Unitree G1 to ground the article in cited literature. Unitree G1 is a humanoid robot that appears in recent arXiv literature as a real-hardware platform for language-guided navigation, cross-embodiment manipulation, subsystem-level energy modeling, and cybersecurity assessment. Across these studies, the robot is treated less as a single benchmark artifact than as a convergence point for embodied control, multimodal policy adaptation, power-aware modeling, and physical-cyber risk analysis. The resulting picture is heterogeneous: the same platform is used to evaluate open-loop indoor navigation with diffusion-based control, contact-rich dexterous manipulation with modality-augmented fine-tuning, a physics-based electrical power model for a seven-degree-of-freedom arm, and a layered software and telemetry stack with nontrivial attack surface (Sam et al., 2 May 2026, Park et al., 1 Dec 2025, Deniz et al., 14 Jun 2026, Mayoral-Vilches et al., 17 Sep 2025).

1. Research roles of the platform

Recent work uses the Unitree G1 in four distinct but technically adjacent roles: as a real-hardware locomotion platform, as a target embodiment for policy transfer, as an experimental substrate for arm-level energy identification, and as a networked cyber-physical system subject to security analysis (Sam et al., 2 May 2026, Park et al., 1 Dec 2025, Deniz et al., 14 Jun 2026, Mayoral-Vilches et al., 17 Sep 2025).

Research area	Role of Unitree G1	Reported result
Language-guided navigation	Real-hardware evaluation platform for FlowDiT Stage II	64.7% task completion in 17 real-world trials
Cross-embodiment manipulation	Main testbed for transfer of GR00T N1.5-3B	94.0% success on “Pick Apple to Bowl” with contact-state fusion
Arm energy modeling	Physical platform for seven-DOF left-arm identification	$R^2 = 0.933$ on identification; $R^2 = 0.965$ on unseen-speed validation
Cybersecurity assessment	Security-mature but high-risk humanoid platform	Periodic telemetry every 300 seconds to named MQTT endpoints

This distribution of use suggests that the G1 is being treated as an intermediate-scale research platform: sufficiently capable to support dexterous manipulation and humanoid navigation, yet sufficiently accessible for subsystem instrumentation and reverse engineering. A plausible implication is that the platform’s research value derives from this combination of embodiment complexity and experimental tractability.

In "Action Agent" (Sam et al., 2 May 2026), the Unitree G1 is the real-hardware evaluation platform for Stage II of a two-stage navigation stack. Stage I takes an initial image and a language instruction and generates a validated first-person reference navigation video $V_{goal}$ , which functions as a Visual Intermediate Representation. Stage II then converts that reference video, together with language, into continuous velocity commands

$(v_x, v_y, \omega),$

where $v_x$ and $v_y$ are robot-frame translational velocities and $\omega$ is yaw rate. The broader framework reports that agentic orchestration raises video-generation success from 35% to 86% across 50 navigation tasks (Sam et al., 2 May 2026).

For the G1, the control model is adapted to the humanoid’s actual velocity dynamics through a sim-to-real calibration step. FlowDiT is pretrained on the RECON outdoor navigation dataset, consisting of 11,830 outdoor navigation episodes on a Clearpath Jackal, and then fine-tuned on 203 Unitree G1 indoor simulation episodes collected in Isaac Sim. Those 203 episodes are split into 162 train and 41 validation episodes and are described as calibrating “the velocity output dynamics to match the G1 humanoid’s motion characteristics.” The indoor Isaac Sim environments are warehouse and hospital-corridor-like.

The action representation is receding-horizon in style: $a_{t:t+H-1}=\{(v_x,v_y,\omega)\}_{t:t+H-1}, \quad H=8.$ Only the first action is executed before replanning. The conditioning vector concatenates four modalities into a 2304-dimensional representation,

$c = [\text{goal}_{vision}^{768} \parallel \text{goal}_{flow}^{256} \parallel \text{obs}_{vision}^{768} \parallel \text{goal}_{lang}^{512}],$

combining DINOv2 goal-video features, learned optical-flow features, an optional live observation frame $I_t$ , and a CLIP language embedding. The paper emphasizes that learned optical flow helps disambiguate “looming” depth ambiguities, while CLIP embeddings support semantic stopping.

FlowDiT models the conditional action distribution with action-space denoising diffusion. For a ground-truth action block $R^2 = 0.965$ 0,

$R^2 = 0.965$ 1

with $R^2 = 0.965$ 2, and the denoising objective is

$R^2 = 0.965$ 3

Inference uses DDIM sampling with 10 denoising steps. The reported training configuration includes action space $R^2 = 0.965$ 4 normalized velocities, frame resolution $R^2 = 0.965$ 5, AdamW with learning rate $R^2 = 0.965$ 6, batch size 8 in FP16, diffusion schedule $R^2 = 0.965$ 7 with linear $R^2 = 0.965$ 8 to $R^2 = 0.965$ 9, and about 43M trainable parameters.

The real-world G1 evaluation uses a head-mounted RGB camera at about 1.2 m height in unseen indoor lab/office environments. Execution is explicitly open-loop: the system captures an initial observation $V_{goal}$ 0, generates a reference video, converts it to velocity commands, and executes without live RGB feedback during motion. In 17 trials, 11 are reported as successful, yielding 64.7% task completion. Success is defined as reaching the intended goal region or completing traversal without obstacle collision. The controller runs at 40–47 Hz, with roughly 20 ms per step on an RTX 5090.

The reported failure modes are video-to-metric scale ambiguity, trajectory divergence from accumulated heading error, and obstacle collision from drift. The authors argue that these failures are primarily consequences of open-loop deployment rather than topologically incorrect Stage I trajectories. This interpretation is consistent with the simulation ablation on the G1 validation split: vision-only DINOv2 yields 58.5% SR, 0.298 ATE, and 73.0% DA; removing flow gives 73.2% SR, 0.309 ATE, and 67.8% DA; removing language gives 65.9% SR, 0.280 ATE, and 80.6% DA; and full FlowDiT gives 73.4% SR, 0.293 ATE, and 76.1% DA. These results suggest that on the G1, semantic scene understanding, explicit motion cues, and language-conditioned termination are complementary rather than interchangeable (Sam et al., 2 May 2026).

3. Manipulation, contact, and cross-embodiment transfer

In "Modality-Augmented Fine-Tuning of Foundation Robot Policies for Cross-Embodiment Manipulation on GR1 and G1" (Park et al., 1 Dec 2025), the Unitree G1 serves as the principal target embodiment for testing whether a pretrained foundation policy can transfer beyond its source morphology. The base model is GR00T N1.5-3B, and the benchmark task is "Pick Apple to Bowl." The paper frames the G1 as the more demanding transfer setting because the embodiment differs from GR1 in arm and hand structure: GR1 is described as using a Fourier hand with a 6-DoF hand and 7-DoF arm, whereas the G1 uses a 7-DoF dexterous hand and 7-DoF arm. The G1 setting also includes ground-truth contact-force measurements.

Because no public G1 dataset exists, the authors construct a custom multi-modal dataset. Demonstrations are generated with cuRobo motion planning, which produces dynamically feasible Cartesian paths satisfying grasp and placement constraints, and then converted into G1 joint commands using analytical inverse kinematics. For each rollout, the recorded signals include joint positions, joint velocities, end-effector pose, and executed joint actions. The dataset also includes RGB-D observations, proprioception, cuRobo-generated reference trajectories, and fingertip and palm contact-force vectors.

The policy remains diffusion-based. The denoising dynamics are written as

$V_{goal}$ 1

followed by decoding to an action sequence,

$V_{goal}$ 2

For contact-state fusion, a binary contact variable $V_{goal}$ 3 is concatenated to proprioception,

$V_{goal}$ 4

and the denoiser conditions on this contact-aware state embedding: $V_{goal}$ 5 A separate contact-modality pathway is also defined through

$V_{goal}$ 6

The reported ablation on G1 is stark. GR00T N1.5 zero-shot transfer achieves 0.0% success with MSE 0.35719. Standard fine-tuning raises this to 48.0% success with MSE 0.031716. Adding a contact encoder yields 74.0% success and MSE 0.024407. Contact fused directly into the state reaches 94.0% success and MSE 0.018623, the best result. RGB-D augmentation alone produces 82.0% success with MSE 0.022596.

The implementation leaves the vision tower and VLM backbone frozen and updates only the projector and embodiment-specific state/action encoders. Training uses AdamW with learning rate $V_{goal}$ 7, weight decay $V_{goal}$ 8, 5% warmup, batch size 32, 20k training steps, and bf16 precision. The diffusion transformer has 12 layers, 8-head cross-attention, and hidden size 1024, while flow-matching training uses a Beta noise distribution with $V_{goal}$ 9 and $(v_x, v_y, \omega),$ 0 over 1,000 diffusion timesteps.

The paper’s central conclusion for the G1 is that embodiment transfer is constrained not only by morphology mismatch but also by modality mismatch. Contact information is reported as more valuable than depth in this benchmark, and early fusion of contact into the proprioceptive state outperforms treating contact as a side-channel. This suggests that on a dexterous humanoid hand, stable control depends on embedding contact into the core control state rather than appending it as auxiliary context (Park et al., 1 Dec 2025).

4. Electrical power modeling of the left arm

"Identification of a Physics-Based Electrical Power Consumption Model for the Unitree G1 Humanoid Arm" (Deniz et al., 14 Jun 2026) restricts attention to the seven-degree-of-freedom left arm: shoulder pitch, shoulder roll, shoulder yaw, elbow, and wrist roll, pitch, and yaw. The stated motivation is that upper-limb motion can consume a substantial fraction of available energy during manipulation tasks, making accurate prediction relevant to energy-aware motion planning, mission-duration estimation, battery management, and thermal monitoring.

The model assumes a BLDC motor with negligible electrical transients,

$(v_x, v_y, \omega),$ 1

joint-level mapping through gear ratio and efficiency,

$(v_x, v_y, \omega),$ 2

and electrical power

$(v_x, v_y, \omega),$ 3

At the joint level this becomes

$(v_x, v_y, \omega),$ 4

The terms are interpreted as mechanical power, copper losses, Coulomb friction, and viscous friction, respectively.

A key addition is baseline-torque correction. Because the arm consumes power even at rest to maintain gravity compensation at the home posture, the paper defines net power relative to baseline,

$(v_x, v_y, \omega),$ 5

with

$(v_x, v_y, \omega),$ 6

and

$(v_x, v_y, \omega),$ 7

This enables prediction of negative net power trajectories when the arm moves to a posture requiring less gravity compensation than the home posture.

To model coordinated motions, pairwise interaction terms are added: $(v_x, v_y, \omega),$ 8 with parameter vector

$(v_x, v_y, \omega),$ 9

The formulation is linear in parameters,

$v_x$ 0

which permits constrained least-squares identification.

The parameters are identified from onboard power measurements on a physical Unitree G1 using the main-board sensor rather than the battery management system. The MBS baseline is about 120 W at rest, compared with 135 W for the BMS. Each trial contains four phases: 2 s pre-idle at the home posture, trajectory execution with 100 Hz references and cubic smooth-step timing, 2 s post-idle, and a smooth return to home. The controller gains are

$v_x$ 1

The identification dataset comprises 1,017 trajectories, of which 897 are retained after filtering. These include single-joint motions for all seven joints, multiple coordinated motions, and five speed levels,

$v_x$ 2

Trajectories contaminated by balance compensation are rejected when

$v_x$ 3

removing 104 trajectories, followed by two passes of $v_x$ 4 residual rejection that remove 16 additional trajectories. Because power is measured at about 1 Hz while kinematics are available at 100 Hz, per-sample fitting is ill-conditioned, with

$v_x$ 5

and $v_x$ 6. The paper therefore averages each trajectory to a single regression sample and solves

$v_x$ 7

subject to

$v_x$ 8

using IPOPT via CasADi.

The identified model achieves $v_x$ 9, RMSE 1.07 W, and MAE 0.86 W on the 897 filtered training trajectories. Validation on 46 hold-out trajectories executed at unseen speeds

$v_y$ 0

yields $v_y$ 1, RMSE 3.58 W, and MAE 2.33 W. The paper reports that viscous friction dominates shoulder pitch and all three wrist joints, copper losses dominate shoulder yaw and the elbow, and Coulomb friction dominates shoulder roll. The largest interaction coefficients include $v_y$ 2 for shoulder roll/elbow and $v_y$ 3 for shoulder pitch/elbow. The model’s stated limitations include Dex3-1 hand payload effects at the wrist, absence of temperature dependence, simplified braking asymmetry, and possible feature collinearity.

5. Software stack, telemetry, and security debates

"Cybersecurity AI: Humanoid Robots as Attack Vectors" (Mayoral-Vilches et al., 17 Sep 2025) presents the Unitree G1 as both a comparatively mature commercial robotics platform and a high-risk cyber-physical system. The paper states that the robot includes encrypted configuration, dynamic credentials, hardware binding, and multi-layer defense, but argues that these properties do not eliminate serious weaknesses.

The reported hardware surface includes a Rockchip RK3588 SoC, 8 GB LPDDR4X, 32 GB eMMC, exposed JST debug connectors, unpopulated JTAG pads, and accessible UART at 115200 baud. The sensor and data surface includes Intel RealSense D435i cameras, dual microphones, IMU, GNSS, and DDS topics carrying sensor and state data. At the software layer, a 9.2 MB master_service supervises 26 daemons, including net-init, ota-box, ota-update, basic_service, ros_bridge, chat_go, vui_service, webrtc_*, ai_sport, motion_switcher, and robot_state_service. The communication surface includes unencrypted DDS/RTPS on the local network, MQTT on port 17883, WebRTC via webrtc_bridge, BLE/Wi‑Fi provisioning and control via upper_bluetooth and chat_go, and a WebSocket to 8.222.78.102:6080 with SSL verification disabled.

A central technical claim concerns the proprietary FMX configuration scheme. The paper describes a dual-layer design with an outer Blowfish-ECB layer and an inner LCG-based masking layer. The outer layer is reported as using 64-bit blocks, no IV, and a static 128-bit fleet-wide key, given in the text as 44c56a97ccf33d585a91c18e1c72382b. The inner layer is modeled as

$v_y$ 4

with the high byte of each state used as an XOR mask: $v_y$ 5 The authors state that the outer layer is fully compromised and that the seed derivation for the inner layer is not fully recovered.

The runtime observations are centered on telemetry. Using SSL_write instrumentation on 9 September 2025, the authors report a 10-minute capture in which structured JSON payloads of 4.5–4.6 KB were sent every 300 seconds to 43.175.228.18:17883 and 43.175.229.18:17883. Reported contents include battery telemetry, IMU orientation, per-joint torque and temperature, service inventory and service state, CPU load arrays, memory usage, and filesystem statistics. The paper states that robot_state_service and ota_boxed establish TLS 1.3 connections within 5 seconds of initialization, with observed rates of approximately 1.03 Mbps and 0.39 Mbps, and that connections auto-recover within seconds after disruption. Additional observed channels include audio capture via vui_service, visual streaming via RealSense/H.264 and cloud streaming, spatial/LIDAR/GNSS channels, and the previously noted WebSocket endpoint.

The second case study places the Alias Robotics Cybersecurity AI framework on the G1’s RK3588 processor. The workflow is described as reconnaissance, vulnerability analysis, exploitation preparation, and attack-surface mapping. The paper states that CAI enumerated live MQTT, WebSocket, and WebRTC endpoints reachable from inside the robot, attempted broker logins using recovered credentials, prepared certificate-authenticated sessions, and produced an attack matrix covering MQTT topic abuse, WebRTC stream hijack, and OTA manipulation paths.

Several claims in this paper require careful qualification. The abstract reports a BLE provisioning protocol with a critical command injection vulnerability enabling root access via malformed Wi‑Fi credentials, as well as hardcoded AES keys shared across all units. However, the detailed summary accompanying the paper states that these specific items are “not verifiable from the provided excerpt.” By contrast, the Blowfish-ECB and LCG findings, the periodic telemetry behavior, and the communication endpoints are explicitly detailed in the supplied material. The authors further interpret the observed telemetry as a “surveillance trojan horse” and argue that it may violate GDPR Articles 6 and 13. Those GDPR and dual-use conclusions are the authors’ interpretations rather than merely descriptive protocol facts (Mayoral-Vilches et al., 17 Sep 2025).

6. Scope, limitations, and terminological disambiguation

The literature presents the Unitree G1 as a platform whose measured behavior depends strongly on task formulation and instrumentation. In navigation, the reported real-world result is 64.7% task completion under explicitly open-loop execution, and the identified failure modes are heading drift, scale mismatch, and obstacle collision from accumulated error rather than topological path failure (Sam et al., 2 May 2026). In manipulation, zero-shot transfer is reported to fail completely, while targeted fine-tuning with embodiment-aligned modalities raises success to 94.0% (Park et al., 1 Dec 2025). In energy modeling, the identified model is limited to the seven-degree-of-freedom left arm and does not model temperature dependence or full-body coupling (Deniz et al., 14 Jun 2026). In cybersecurity, some claims are direct runtime observations, some are reverse-engineering results, and some are higher-level interpretations about surveillance and offensive potential (Mayoral-Vilches et al., 17 Sep 2025).

A recurrent misconception arises from the label “G1” itself. In unrelated astrophysical literature, “G1” denotes a Galactic Center infrared source near Sgr A* rather than the humanoid robot. That source is described as a very red, extended infrared object with Br- $v_y$ 6 emission on a highly eccentric orbit and is interpreted as the second known spatially resolved case of tidal interaction with a supermassive black hole (Witzel et al., 2017). This terminological overlap is purely nominal and has no relation to the Unitree platform.

Taken together, the arXiv record depicts the Unitree G1 as a humanoid research platform positioned at the intersection of locomotion, dexterous manipulation, power-aware subsystem modeling, and robot security. The common theme across these otherwise separate lines of work is not a single unified benchmark, but the use of the same embodied system to expose different bottlenecks: sim-to-real calibration for velocity control, modality design for dexterous transfer, identifiability for onboard energy models, and the interaction between robotic autonomy and network security.