Papers
Topics
Authors
Recent
2000 character limit reached

Gallant: Voxel-Grid Humanoid Locomotion

Updated 21 November 2025
  • The paper presents a novel voxel-grid framework that integrates LiDAR-based perception with a z-grouped 2D CNN to map complex 3D environments directly to control actions.
  • It utilizes end-to-end reinforcement learning via PPO and domain randomization to ensure robust policy transfer from simulation to real-world humanoid locomotion.
  • Empirical results demonstrate high success rates across varied terrains, surpassing traditional height-map approaches in environments with overhead and lateral obstacles.

Gallant is a voxel-grid-based framework for humanoid locomotion and local navigation in 3D constrained terrains. The method leverages a compact, robot-centric voxel grid representation of the environment derived from LiDAR sensors and processes it with a z-grouped 2D convolutional neural network (CNN) to directly map perception to control actions. Gallant supports end-to-end policy optimization via deep reinforcement learning, employing high-fidelity LiDAR simulation with extensive domain randomization to ensure transferability from simulation to real-world deployment. The approach enables a single control policy to successfully navigate a diverse set of 3D environments, including scenarios involving lateral clutter, overhead obstacles, multi-level structures, and narrow passages, outperforming baselines that rely solely on height maps or more computationally expensive perception modules (Ben et al., 18 Nov 2025).

1. Voxel Grid Environmental Encoding

Gallant constructs a voxel grid as its core environment representation. The perception volume, denoted as Ω\Omega, is defined in the robot's local frame as a cuboid:

Ω=[0.8,0.8]x×[0.8,0.8]y×[1.0,1.0]z meters\Omega = [-0.8, 0.8]_x \times [-0.8, 0.8]_y \times [-1.0, 1.0]_z\ \text{meters}

This volume is discretized with resolution Δ=0.05\Delta = 0.05 m, resulting in a grid of size 32×32×4032 \times 32 \times 40 along the (x,y,z)(x, y, z) axes. At each timestep tt, point clouds from two torso-mounted LiDAR units are unified in the torso frame and voxelized into a binary occupancy tensor X{0,1}C×H×WX \in \{0,1\}^{C \times H \times W}, where

Xc,v,u(t)={1if  PiΩ:PixxminΔ=u,PiyyminΔ=v,PizzminΔ=c 0otherwiseX_{c,v,u}(t) = \begin{cases} 1 & \text{if}\ \exists\ P_i \in \Omega: \left\lfloor \frac{P_i \cdot x - x_{\min}}{\Delta}\right\rfloor = u, \left\lfloor \frac{P_i \cdot y - y_{\min}}{\Delta}\right\rfloor = v, \left\lfloor \frac{P_i \cdot z - z_{\min}}{\Delta}\right\rfloor = c \ 0 & \text{otherwise} \end{cases}

This simple occupied/unoccupied encoding smooths out LiDAR sensor noise while preserving critical multilayer structures required for robust navigation—including overhangs, ceilings, and gaps.

2. Perception Processing via z-Grouped 2D CNN

To map voxel occupancy grids to actionable perceptual features, Gallant employs a z-grouped 2D CNN architecture. Instead of computationally intensive 3D convolutions, the zz-axis is treated as channels, and standard 2D convolutions are performed across the xyxy plane. The main convolutional layer applies:

Yo,v,u=σ[c=0C1Δv,ΔuWo,c,Δv,ΔuXc,v+Δv,u+Δu+bo]Y_{o,v,u} = \sigma\left[ \sum_{c=0}^{C-1} \sum_{\Delta v, \Delta u} W_{o,c, \Delta v, \Delta u} \cdot X_{c, v + \Delta v, u + \Delta u} + b_o \right]

where WRO×C×k×kW \in \mathbb{R}^{O \times C \times k \times k} are kernel weights and σ\sigma is the Mish activation. The processing pipeline consists of three repetitions of 3×33\times3 Conv2D (stride 2, padding 1) with 8 output channels each, followed by flattening and two fully-connected layers (256 and 64 units, respectively, both with Mish activations), resulting in a compact feature vector hcnnR64h_{cnn} \in \mathbb{R}^{64}.

This approach reduces the computational load by a factor of approximately kk compared to full 3D CNNs, allowing real-time deployment using modern embedded hardware. The structure preserves the 3D nature of the environment necessary for safe foot and head clearance.

3. End-to-End Control Policy Optimization

Locomotion is formalized as a partially observable Markov decision process (POMDP) (S,A,O,P,R,Ω,γ)(\mathcal{S}, \mathcal{A}, \mathcal{O}, \mathcal{P}, \mathcal{R}, \Omega, \gamma). Training is performed via Proximal Policy Optimization (PPO), with the observation oto_t comprising:

  • Target bearing PtP_t
  • Elapsed/remaining time
  • Past actions at4:t1a_{t-4:t-1}
  • Proprioceptive history [ω,g,q,q˙]t5:t[\omega, g, q, \dot{q}]_{t-5:t}
  • Current voxel grid Voxel_GridtVoxel\_Grid_t
  • Privileged information {v˙t,Height_Mapt}\{\dot{v}_t, Height\_Map_t\}

The policy (actor network) πθ(ot)\pi_\theta(o_t) outputs atR29a_t \in \mathbb{R}^{29} (desired joint position targets), while the critic VϕV_\phi receives the full observation plus privileged inputs. The clipped PPO objective is:

LCLIP(θ)=Et[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{CLIP}(\theta) = \mathbb{E}_t \left[\min\left(r_t(\theta) \hat{A}_t, \operatorname{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)\hat{A}_t\right)\right]

with rt(θ)=πθ(atot)/πθold(atot)r_t(\theta) = \pi_\theta(a_t|o_t)/\pi_{\theta_{old}}(a_t|o_t). The full objective also includes value loss and an entropy bonus.

Reward shaping incorporates a sparse reach reward, velocity alignment, head-height maintenance, and foot clearance, all geometry-aware for traversing complex 3D structures.

4. High-Fidelity LiDAR Simulation and Domain Randomization

The training pipeline employs a high-fidelity LiDAR simulation based on NVIDIA Warp to compute ray-mesh intersections in the body frames:

raycast(TM,p,d)=T1raycast(M,T1p,R1d)raycast(TM, p, d) = T^{-1} \cdot raycast(M, T^{-1} p, R^{-1} d)

Both static terrains and all dynamic robot meshes are included. Domain randomization is applied to sensor and system noise:

  • LiDAR pose noise N(0,1cm)\sim \mathcal{N}(0, 1 \mathrm{cm})
  • Orientation jitter N(0,(π/180)2rad2)\sim \mathcal{N}(0, (\pi/180)^2 \mathrm{rad}^2)
  • Hit-point noise N(0,1cm)\sim \mathcal{N}(0, 1 \mathrm{cm})
  • Random latency 100–200 ms (scan rate 10 Hz)
  • 2% random voxel dropout

Auxiliary randomizations include robot mass, friction, joint gains, COM offsets, and initial state noise. These perturbations are critical for bridging the sim-to-real gap in both perception and dynamics fidelity.

5. Large-Scale Parallel Training and Curriculum

Policy training uses parallel PPO with 8×1,0248 \times 1,024 simulated environments over $4,000$ iterations, 4 PPO epochs per iteration, and 8 minibatches. Main hyperparameters are γ=0.99\gamma = 0.99, λGAE=0.95\lambda_{GAE}=0.95, PPO clip ϵ=0.2\epsilon=0.2, entropy coefficient $0.003$, and learning rate 5×1045 \times 10^{-4}. A curriculum spans eight terrain types: Plane, Ceiling, Forest, Door, Platform, Pile, Up-stair, and Down-stair, with terrain difficulty s[0,1]s \in [0,1] interpolating generation parameters pτ(s)=(1s)pτmin+spτmaxp_{\tau}(s) = (1-s) p_\tau^{min} + s p_\tau^{max}.

Reward function structure is as follows:

Term Description Formula (see paper for details)
Reach Bonus for reaching the target rreach=11+Pt21Tr1[t>TTr]r_{reach} = \frac{1}{1 + \|P_t\|^2} \frac{1}{T_r} \mathbb{1}[t>T-T_r]
Velocity Align Direction-aligned velocity rvel_dirr_{vel\_dir}
Head Height Maintains head at safe height rheadr_{head}
Foot Clearance Ensures foot avoids collisions/obstacles rfeetr_{feet}

Symmetry augmentation (mirroring across the xxyy plane and voxel flipping) is applied to improve policy robustness.

6. Empirical Performance and Real-World Deployment

Simulation Results

Under worst-case terrain difficulty (pτmaxp_\tau^{max}) and over 1,000 evaluation episodes per terrain (5 random seeds), Gallant achieves the following success rates (EsuccE_{succ}):

Terrain Plane Ceiling Forest Door Platform Pile Up-stair Down-stair
Gallant 100.0 97.1 84.3 98.7 96.1 82.1 96.2 97.9

Ablation experiments indicate:

  • Excluding self-scanning of dynamic meshes reduces ceiling traversal success to \sim28%.
  • Replacing the z-grouped 2D CNN with full 3D or sparse variants lowers average success or increases inference latency.
  • Restricting perceptual input to height maps yields failures on overhead and multi-layered obstacles.
  • The 0.05 m voxel resolution is optimal for perceptual coverage and detail; finer (0.025 m) sacrifices field-of-view, coarser (0.10 m) loses necessary precision.

Real-World Results

Gallant's policy is deployed on a Unitree G1 robot, executing the full onboard pipeline (JT128 LiDAR to OctoMap to 32×32×4032 \times 32 \times 40 voxel grid at 10 Hz, control at 50 Hz) on an NVIDIA Orin NX module. Over 15 trials per terrain, Gallant achieves near-100% success traversing plane, stair, and platform scenarios, and consistently high performance (>80%>80\%) on ceiling, door, and pile terrains.

Gallant surpasses baselines:

  • The HeightMap-only baseline fails in environments with overhead or lateral obstacles.
  • The “NoDR” (no domain randomization) variant exhibits frequent collisions or incorrect gap navigation due to insufficient robustness to sim-to-real discrepancies.

Observed success rates in simulation strongly correlate with those in real-world deployment, supporting the effectiveness of Gallant's domain randomization and modeling fidelity (Ben et al., 18 Nov 2025).

7. Significance and Methodological Implications

Gallant demonstrates that an occupancy-based 32×32×4032 \times 32 \times 40 voxel grid, processed efficiently with a z-grouped 2D CNN, suffices for near-lossless encoding of 3D locospatial constraints relevant for humanoid locomotion and local navigation. The single policy learned via end-to-end PPO exhibits robustness across a spectrum of 3D-constrained tasks, including stair climbing and platform stepping, traditionally challenging for methods constrained to ground-plane or flattened representations.

The findings suggest that equipped with sufficiently high-fidelity simulation and perceptual domain randomization, voxel-grid-based methods are practical for real-world robotic control at scale. A plausible implication is the broader applicability of such approaches to other robotic platforms where lightweight yet comprehensive 3D perception is critical (Ben et al., 18 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Gallant: Voxel Grid-Based Humanoid Locomotion.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube