Gallant: Voxel-Grid Humanoid Locomotion
- The paper presents a novel voxel-grid framework that integrates LiDAR-based perception with a z-grouped 2D CNN to map complex 3D environments directly to control actions.
- It utilizes end-to-end reinforcement learning via PPO and domain randomization to ensure robust policy transfer from simulation to real-world humanoid locomotion.
- Empirical results demonstrate high success rates across varied terrains, surpassing traditional height-map approaches in environments with overhead and lateral obstacles.
Gallant is a voxel-grid-based framework for humanoid locomotion and local navigation in 3D constrained terrains. The method leverages a compact, robot-centric voxel grid representation of the environment derived from LiDAR sensors and processes it with a z-grouped 2D convolutional neural network (CNN) to directly map perception to control actions. Gallant supports end-to-end policy optimization via deep reinforcement learning, employing high-fidelity LiDAR simulation with extensive domain randomization to ensure transferability from simulation to real-world deployment. The approach enables a single control policy to successfully navigate a diverse set of 3D environments, including scenarios involving lateral clutter, overhead obstacles, multi-level structures, and narrow passages, outperforming baselines that rely solely on height maps or more computationally expensive perception modules (Ben et al., 18 Nov 2025).
1. Voxel Grid Environmental Encoding
Gallant constructs a voxel grid as its core environment representation. The perception volume, denoted as , is defined in the robot's local frame as a cuboid:
This volume is discretized with resolution m, resulting in a grid of size along the axes. At each timestep , point clouds from two torso-mounted LiDAR units are unified in the torso frame and voxelized into a binary occupancy tensor , where
This simple occupied/unoccupied encoding smooths out LiDAR sensor noise while preserving critical multilayer structures required for robust navigation—including overhangs, ceilings, and gaps.
2. Perception Processing via z-Grouped 2D CNN
To map voxel occupancy grids to actionable perceptual features, Gallant employs a z-grouped 2D CNN architecture. Instead of computationally intensive 3D convolutions, the -axis is treated as channels, and standard 2D convolutions are performed across the plane. The main convolutional layer applies:
where are kernel weights and is the Mish activation. The processing pipeline consists of three repetitions of Conv2D (stride 2, padding 1) with 8 output channels each, followed by flattening and two fully-connected layers (256 and 64 units, respectively, both with Mish activations), resulting in a compact feature vector .
This approach reduces the computational load by a factor of approximately compared to full 3D CNNs, allowing real-time deployment using modern embedded hardware. The structure preserves the 3D nature of the environment necessary for safe foot and head clearance.
3. End-to-End Control Policy Optimization
Locomotion is formalized as a partially observable Markov decision process (POMDP) . Training is performed via Proximal Policy Optimization (PPO), with the observation comprising:
- Target bearing
- Elapsed/remaining time
- Past actions
- Proprioceptive history
- Current voxel grid
- Privileged information
The policy (actor network) outputs (desired joint position targets), while the critic receives the full observation plus privileged inputs. The clipped PPO objective is:
with . The full objective also includes value loss and an entropy bonus.
Reward shaping incorporates a sparse reach reward, velocity alignment, head-height maintenance, and foot clearance, all geometry-aware for traversing complex 3D structures.
4. High-Fidelity LiDAR Simulation and Domain Randomization
The training pipeline employs a high-fidelity LiDAR simulation based on NVIDIA Warp to compute ray-mesh intersections in the body frames:
Both static terrains and all dynamic robot meshes are included. Domain randomization is applied to sensor and system noise:
- LiDAR pose noise
- Orientation jitter
- Hit-point noise
- Random latency 100–200 ms (scan rate 10 Hz)
- 2% random voxel dropout
Auxiliary randomizations include robot mass, friction, joint gains, COM offsets, and initial state noise. These perturbations are critical for bridging the sim-to-real gap in both perception and dynamics fidelity.
5. Large-Scale Parallel Training and Curriculum
Policy training uses parallel PPO with simulated environments over $4,000$ iterations, 4 PPO epochs per iteration, and 8 minibatches. Main hyperparameters are , , PPO clip , entropy coefficient $0.003$, and learning rate . A curriculum spans eight terrain types: Plane, Ceiling, Forest, Door, Platform, Pile, Up-stair, and Down-stair, with terrain difficulty interpolating generation parameters .
Reward function structure is as follows:
| Term | Description | Formula (see paper for details) |
|---|---|---|
| Reach | Bonus for reaching the target | |
| Velocity Align | Direction-aligned velocity | |
| Head Height | Maintains head at safe height | |
| Foot Clearance | Ensures foot avoids collisions/obstacles |
Symmetry augmentation (mirroring across the – plane and voxel flipping) is applied to improve policy robustness.
6. Empirical Performance and Real-World Deployment
Simulation Results
Under worst-case terrain difficulty () and over 1,000 evaluation episodes per terrain (5 random seeds), Gallant achieves the following success rates ():
| Terrain | Plane | Ceiling | Forest | Door | Platform | Pile | Up-stair | Down-stair |
|---|---|---|---|---|---|---|---|---|
| Gallant | 100.0 | 97.1 | 84.3 | 98.7 | 96.1 | 82.1 | 96.2 | 97.9 |
Ablation experiments indicate:
- Excluding self-scanning of dynamic meshes reduces ceiling traversal success to 28%.
- Replacing the z-grouped 2D CNN with full 3D or sparse variants lowers average success or increases inference latency.
- Restricting perceptual input to height maps yields failures on overhead and multi-layered obstacles.
- The 0.05 m voxel resolution is optimal for perceptual coverage and detail; finer (0.025 m) sacrifices field-of-view, coarser (0.10 m) loses necessary precision.
Real-World Results
Gallant's policy is deployed on a Unitree G1 robot, executing the full onboard pipeline (JT128 LiDAR to OctoMap to voxel grid at 10 Hz, control at 50 Hz) on an NVIDIA Orin NX module. Over 15 trials per terrain, Gallant achieves near-100% success traversing plane, stair, and platform scenarios, and consistently high performance () on ceiling, door, and pile terrains.
Gallant surpasses baselines:
- The HeightMap-only baseline fails in environments with overhead or lateral obstacles.
- The “NoDR” (no domain randomization) variant exhibits frequent collisions or incorrect gap navigation due to insufficient robustness to sim-to-real discrepancies.
Observed success rates in simulation strongly correlate with those in real-world deployment, supporting the effectiveness of Gallant's domain randomization and modeling fidelity (Ben et al., 18 Nov 2025).
7. Significance and Methodological Implications
Gallant demonstrates that an occupancy-based voxel grid, processed efficiently with a z-grouped 2D CNN, suffices for near-lossless encoding of 3D locospatial constraints relevant for humanoid locomotion and local navigation. The single policy learned via end-to-end PPO exhibits robustness across a spectrum of 3D-constrained tasks, including stair climbing and platform stepping, traditionally challenging for methods constrained to ground-plane or flattened representations.
The findings suggest that equipped with sufficiently high-fidelity simulation and perceptual domain randomization, voxel-grid-based methods are practical for real-world robotic control at scale. A plausible implication is the broader applicability of such approaches to other robotic platforms where lightweight yet comprehensive 3D perception is critical (Ben et al., 18 Nov 2025).