Gallant: Voxel-Grid Humanoid Locomotion

Updated 21 November 2025

The paper presents a novel voxel-grid framework that integrates LiDAR-based perception with a z-grouped 2D CNN to map complex 3D environments directly to control actions.
It utilizes end-to-end reinforcement learning via PPO and domain randomization to ensure robust policy transfer from simulation to real-world humanoid locomotion.
Empirical results demonstrate high success rates across varied terrains, surpassing traditional height-map approaches in environments with overhead and lateral obstacles.

Gallant is a voxel-grid-based framework for humanoid locomotion and local navigation in 3D constrained terrains. The method leverages a compact, robot-centric voxel grid representation of the environment derived from LiDAR sensors and processes it with a z-grouped 2D convolutional neural network (CNN) to directly map perception to control actions. Gallant supports end-to-end policy optimization via deep reinforcement learning, employing high-fidelity LiDAR simulation with extensive domain randomization to ensure transferability from simulation to real-world deployment. The approach enables a single control policy to successfully navigate a diverse set of 3D environments, including scenarios involving lateral clutter, overhead obstacles, multi-level structures, and narrow passages, outperforming baselines that rely solely on height maps or more computationally expensive perception modules (Ben et al., 18 Nov 2025).

1. Voxel Grid Environmental Encoding

Gallant constructs a voxel grid as its core environment representation. The perception volume, denoted as $\Omega$ , is defined in the robot's local frame as a cuboid:

$\Omega = [-0.8, 0.8]_x \times [-0.8, 0.8]_y \times [-1.0, 1.0]_z\ \text{meters}$

This volume is discretized with resolution $\Delta = 0.05$ m, resulting in a grid of size $32 \times 32 \times 40$ along the $(x, y, z)$ axes. At each timestep $t$ , point clouds from two torso-mounted LiDAR units are unified in the torso frame and voxelized into a binary occupancy tensor $X \in \{0,1\}^{C \times H \times W}$ , where

$X_{c,v,u}(t) = \begin{cases} 1 & \text{if}\ \exists\ P_i \in \Omega: \left\lfloor \frac{P_i \cdot x - x_{\min}}{\Delta}\right\rfloor = u, \left\lfloor \frac{P_i \cdot y - y_{\min}}{\Delta}\right\rfloor = v, \left\lfloor \frac{P_i \cdot z - z_{\min}}{\Delta}\right\rfloor = c \ 0 & \text{otherwise} \end{cases}$

This simple occupied/unoccupied encoding smooths out LiDAR sensor noise while preserving critical multilayer structures required for robust navigation—including overhangs, ceilings, and gaps.

2. Perception Processing via z-Grouped 2D CNN

To map voxel occupancy grids to actionable perceptual features, Gallant employs a z-grouped 2D CNN architecture. Instead of computationally intensive 3D convolutions, the $z$ -axis is treated as channels, and standard 2D convolutions are performed across the $xy$ plane. The main convolutional layer applies:

$Y_{o,v,u} = \sigma\left[ \sum_{c=0}^{C-1} \sum_{\Delta v, \Delta u} W_{o,c, \Delta v, \Delta u} \cdot X_{c, v + \Delta v, u + \Delta u} + b_o \right]$

where $W \in \mathbb{R}^{O \times C \times k \times k}$ are kernel weights and $\sigma$ is the Mish activation. The processing pipeline consists of three repetitions of $3\times3$ Conv2D (stride 2, padding 1) with 8 output channels each, followed by flattening and two fully-connected layers (256 and 64 units, respectively, both with Mish activations), resulting in a compact feature vector $h_{cnn} \in \mathbb{R}^{64}$ .

This approach reduces the computational load by a factor of approximately $k$ compared to full 3D CNNs, allowing real-time deployment using modern embedded hardware. The structure preserves the 3D nature of the environment necessary for safe foot and head clearance.

3. End-to-End Control Policy Optimization

Locomotion is formalized as a partially observable Markov decision process (POMDP) $(\mathcal{S}, \mathcal{A}, \mathcal{O}, \mathcal{P}, \mathcal{R}, \Omega, \gamma)$ . Training is performed via Proximal Policy Optimization (PPO), with the observation $o_t$ comprising:

Target bearing $P_t$
Elapsed/remaining time
Past actions $a_{t-4:t-1}$
Proprioceptive history $[\omega, g, q, \dot{q}]_{t-5:t}$
Current voxel grid $Voxel\_Grid_t$
Privileged information $\{\dot{v}_t, Height\_Map_t\}$

The policy (actor network) $\pi_\theta(o_t)$ outputs $a_t \in \mathbb{R}^{29}$ (desired joint position targets), while the critic $V_\phi$ receives the full observation plus privileged inputs. The clipped PPO objective is:

$L^{CLIP}(\theta) = \mathbb{E}_t \left[\min\left(r_t(\theta) \hat{A}_t, \operatorname{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)\hat{A}_t\right)\right]$

with $r_t(\theta) = \pi_\theta(a_t|o_t)/\pi_{\theta_{old}}(a_t|o_t)$ . The full objective also includes value loss and an entropy bonus.

Reward shaping incorporates a sparse reach reward, velocity alignment, head-height maintenance, and foot clearance, all geometry-aware for traversing complex 3D structures.

4. High-Fidelity LiDAR Simulation and Domain Randomization

The training pipeline employs a high-fidelity LiDAR simulation based on NVIDIA Warp to compute ray-mesh intersections in the body frames:

$raycast(TM, p, d) = T^{-1} \cdot raycast(M, T^{-1} p, R^{-1} d)$

Both static terrains and all dynamic robot meshes are included. Domain randomization is applied to sensor and system noise:

LiDAR pose noise $\sim \mathcal{N}(0, 1 \mathrm{cm})$
Orientation jitter $\sim \mathcal{N}(0, (\pi/180)^2 \mathrm{rad}^2)$
Hit-point noise $\sim \mathcal{N}(0, 1 \mathrm{cm})$
Random latency 100–200 ms (scan rate 10 Hz)
2% random voxel dropout

Auxiliary randomizations include robot mass, friction, joint gains, COM offsets, and initial state noise. These perturbations are critical for bridging the sim-to-real gap in both perception and dynamics fidelity.

5. Large-Scale Parallel Training and Curriculum

Policy training uses parallel PPO with $8 \times 1,024$ simulated environments over $4,000$ iterations, 4 PPO epochs per iteration, and 8 minibatches. Main hyperparameters are $\gamma = 0.99$ , $\lambda_{GAE}=0.95$ , PPO clip $\epsilon=0.2$ , entropy coefficient $0.003$, and learning rate $5 \times 10^{-4}$ . A curriculum spans eight terrain types: Plane, Ceiling, Forest, Door, Platform, Pile, Up-stair, and Down-stair, with terrain difficulty $s \in [0,1]$ interpolating generation parameters $p_{\tau}(s) = (1-s) p_\tau^{min} + s p_\tau^{max}$ .

Reward function structure is as follows:

Term	Description	Formula (see paper for details)
Reach	Bonus for reaching the target	$r_{reach} = \frac{1}{1 + \\|P_t\\|^2} \frac{1}{T_r} \mathbb{1}[t>T-T_r]$
Velocity Align	Direction-aligned velocity	$r_{vel\_dir}$
Head Height	Maintains head at safe height	$r_{head}$
Foot Clearance	Ensures foot avoids collisions/obstacles	$r_{feet}$

Symmetry augmentation (mirroring across the $x$ – $y$ plane and voxel flipping) is applied to improve policy robustness.

6. Empirical Performance and Real-World Deployment

Simulation Results

Under worst-case terrain difficulty ( $p_\tau^{max}$ ) and over 1,000 evaluation episodes per terrain (5 random seeds), Gallant achieves the following success rates ( $E_{succ}$ ):

Terrain	Plane	Ceiling	Forest	Door	Platform	Pile	Up-stair	Down-stair
Gallant	100.0	97.1	84.3	98.7	96.1	82.1	96.2	97.9

Ablation experiments indicate:

Excluding self-scanning of dynamic meshes reduces ceiling traversal success to $\sim$ 28%.
Replacing the z-grouped 2D CNN with full 3D or sparse variants lowers average success or increases inference latency.
Restricting perceptual input to height maps yields failures on overhead and multi-layered obstacles.
The 0.05 m voxel resolution is optimal for perceptual coverage and detail; finer (0.025 m) sacrifices field-of-view, coarser (0.10 m) loses necessary precision.

Real-World Results

Gallant's policy is deployed on a Unitree G1 robot, executing the full onboard pipeline (JT128 LiDAR to OctoMap to $32 \times 32 \times 40$ voxel grid at 10 Hz, control at 50 Hz) on an NVIDIA Orin NX module. Over 15 trials per terrain, Gallant achieves near-100% success traversing plane, stair, and platform scenarios, and consistently high performance ( $>80\%$ ) on ceiling, door, and pile terrains.

Gallant surpasses baselines:

The HeightMap-only baseline fails in environments with overhead or lateral obstacles.
The “NoDR” (no domain randomization) variant exhibits frequent collisions or incorrect gap navigation due to insufficient robustness to sim-to-real discrepancies.

Observed success rates in simulation strongly correlate with those in real-world deployment, supporting the effectiveness of Gallant's domain randomization and modeling fidelity (Ben et al., 18 Nov 2025).

7. Significance and Methodological Implications

Gallant demonstrates that an occupancy-based $32 \times 32 \times 40$ voxel grid, processed efficiently with a z-grouped 2D CNN, suffices for near-lossless encoding of 3D locospatial constraints relevant for humanoid locomotion and local navigation. The single policy learned via end-to-end PPO exhibits robustness across a spectrum of 3D-constrained tasks, including stair climbing and platform stepping, traditionally challenging for methods constrained to ground-plane or flattened representations.

The findings suggest that equipped with sufficiently high-fidelity simulation and perceptual domain randomization, voxel-grid-based methods are practical for real-world robotic control at scale. A plausible implication is the broader applicability of such approaches to other robotic platforms where lightweight yet comprehensive 3D perception is critical (Ben et al., 18 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3D Constrained Terrains (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Gallant: Voxel Grid-Based Humanoid Locomotion.

Gallant: Voxel-Grid Humanoid Locomotion

1. Voxel Grid Environmental Encoding

2. Perception Processing via z-Grouped 2D CNN

3. End-to-End Control Policy Optimization

4. High-Fidelity LiDAR Simulation and Domain Randomization

5. Large-Scale Parallel Training and Curriculum

6. Empirical Performance and Real-World Deployment

Simulation Results

Real-World Results

7. Significance and Methodological Implications

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Gallant: Voxel-Grid Humanoid Locomotion

1. Voxel Grid Environmental Encoding

2. Perception Processing via z-Grouped 2D CNN

3. End-to-End Control Policy Optimization

4. High-Fidelity LiDAR Simulation and Domain Randomization

5. Large-Scale Parallel Training and Curriculum

6. Empirical Performance and Real-World Deployment

Simulation Results

Real-World Results

7. Significance and Methodological Implications

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research