SemLoco: Semantic-Aware Quadruped Locomotion

Updated 4 July 2026

SemLoco is a semantic-aware quadruped locomotion framework that fuses semantic maps with elevation data to safely select footholds in cluttered spaces.
It uses pixel-wise foothold safety inference by integrating semantic costs with Raibert-style nominal step planning to maintain dynamic stability.
A two-stage reinforcement learning curriculum, progressing from soft virtual obstacles to rigid obstacle interaction, underpins its robust performance.

Searching arXiv for the SemLoco paper and closely related names to ensure accurate, up-to-date citations. arXiv search query: id:([2603.02657](/papers/2603.02657)) OR title:"Watch Your Step: Learning Semantically-Guided Locomotion in Cluttered Environment" OR title:SemLoco SemLoco is a semantic-aware quadruped locomotion framework for cluttered environments in which the principal hazard is not large terrain discontinuities but low-lying, semantically important objects such as cables, smartphones, small devices, boxes, and debris. It was introduced in "Watch Your Step: Learning Semantically-Guided Locomotion in Cluttered Environment" and is designed to connect semantic scene understanding directly to foothold-level control rather than leaving semantics at a high-level navigation layer. The framework combines a semantic map with an elevation map, performs pixel-wise foothold safety inference around nominal Raibert footholds, and trains a low-level controller by a two-stage reinforcement-learning curriculum with soft and hard constraints (Liang et al., 3 Mar 2026).

1. Conceptual scope and target problem

SemLoco addresses a failure mode that geometry-centric perceptive locomotion systems handle poorly: inadvertent stepping on small, low-profile, or fragile objects in cluttered indoor spaces. The paper’s motivating examples are cables, consumer electronics, phones, and scattered debris. In this regime, a robot may remain dynamically stable while still damaging the environment, because objects that are semantically costly can be geometrically subtle (Liang et al., 3 Mar 2026).

The framework is built on three observations. First, low-lying objects are often too small to stand out in depth maps, and may be flattened into ground by noise, smoothing, and interpolation in elevation mapping. Second, pure geometry is semantically ambiguous: height variation may denote a hazard, a benign traversable surface, or an intended foothold. Third, semantics used only at the route-planning layer are too coarse for dense clutter, because a globally reasonable path can still yield a locally unsafe foot placement. SemLoco therefore treats semantics specifically as a foothold-safety signal rather than only a navigation preference.

This makes SemLoco distinct from systems that merely augment high-level planning with semantic costs. Its novelty is to use semantics to shape explicit foothold targets and low-level control. The semantic map is not an auxiliary visualization; it is converted into traversability costs and coupled to the local foothold planner and the RL policy.

2. System assumptions, state representation, and operating regime

The platform is a Unitree Go2 quadruped in both simulation and real deployment. The system assumes access to proprioception, a local elevation map, and a local semantic map centered on the robot. In the real system, semantic perception is produced by a head-mounted spatial memory module called Odin1, and all computation is performed onboard. The controller runs at $50\,\text{Hz}$ , perception updates asynchronously at $8$– $10\,\text{Hz}$ , forward speed is capped at $0.7\,\text{m/s}$ , and the robot can automatically stop under low map confidence (Liang et al., 3 Mar 2026).

The local map covers $1.5\,\text{m}$ in the forward direction and $1.2\,\text{m}$ laterally, with $0.05\,\text{m}$ resolution, producing a $30\times24$ grid, or $720$ cells. The observation is defined as

$\mathbf{O}_t= \left[ \mathbf{C}_t^{\text{cmd}}, \mathbf{B}_t, \mathbf{O}_t^{\text{proprio}}, \mathbf{O}_t^{\text{extero}} \right]\in\mathbb{R}^{1513}.$

The command vector is

$8$0

and the behavior variables form a $8$1-dimensional vector containing the nominal base height, foot swing height, timing offsets for foot pairs, contact-state timers, gait frequency, duty factor, stance width, and stance length.

The proprioceptive state is

$8$2

where the terms denote estimated base linear velocity, base angular velocity, gravity vector in body frame, joint positions, joint velocities, and the previous two actions. The exteroceptive state is

$8$3

with each map in $8$4.

The action is a $8$5-dimensional joint-position offset vector,

$8$6

tracked by a PD controller with gains $8$7 and $8$8. In effect, SemLoco operates as a local semantic-contact policy over a robot-centered map rather than as a global semantic planner.

3. Semantic foothold planning and pixel-wise safety inference

At deployment time, SemLoco fuses a semantic map with a depth-derived elevation map. The semantic map is converted into traversability costs via class-specific weights, so semantic categories are collapsed into values that reflect how undesirable stepping on them would be. The paper’s examples focus on fragile or undesirable-to-step-on objects rather than open-vocabulary semantics or dense scene graphs (Liang et al., 3 Mar 2026).

The foothold planner begins from a Raibert-style nominal foothold. For leg $8$9,

$10\,\text{Hz}$ 0

with stance duration

$10\,\text{Hz}$ 1

linear offset

$10\,\text{Hz}$ 2

and yaw-induced lateral offset

$10\,\text{Hz}$ 3

Around each nominal foothold, the method constructs a local candidate grid $10\,\text{Hz}$ 4. Each candidate foothold $10\,\text{Hz}$ 5 is scored by

$10\,\text{Hz}$ 6

Here $10\,\text{Hz}$ 7 penalizes deviation from the dynamically desirable Raibert step, while $10\,\text{Hz}$ 8 is a much larger penalty for semantic collision. Collision is tested against semantically known obstacles with AABB tests, dilated by the foot radius $10\,\text{Hz}$ 9 to provide a safety margin. The selected foothold is

$0.7\,\text{m/s}$ 0

This local search is what the paper refers to as pixel-wise foothold safety inference. Operationally, it evaluates map cells or local candidate positions around the nominal step and infers an exact safe support location for each swing leg. The inference is semantic-aware because the unsafe set is induced by semantic classes rather than only by elevation.

The RL controller is coupled to this planner through an explicit semantic foothold-tracking reward:

$0.7\,\text{m/s}$ 1

Thus the policy is not merely given semantic input; it is trained to realize semantically safe foothold targets.

The method also introduces a unilateral minimum-clearance penalty. The swing reference height is

$0.7\,\text{m/s}$ 2

and the penalty is

$0.7\,\text{m/s}$ 3

Because the penalty is one-sided, the robot is punished only for being too low, not for lifting higher than nominal. This detail is central to the method’s ability to clear clutter.

4. Reinforcement-learning formulation and two-stage curriculum

SemLoco uses a two-stage RL curriculum motivated by the exploration difficulty of cluttered locomotion. If rigid obstacles are present from the start, early errors cause trips, falls, and terminations before the policy has learned stable walking or perception-action coupling. The method therefore decomposes learning into semantic stepping under soft constraints and adaptation under hard contact constraints (Liang et al., 3 Mar 2026).

The total reward has a multiplicative structure,

$0.7\,\text{m/s}$ 4

where

$0.7\,\text{m/s}$ 5

The velocity term is

$0.7\,\text{m/s}$ 6

and the penalty term aggregates costs such as body collisions, torque or velocity regularization, and clearance violations.

Stage 1: virtual obstacles

In the first stage, obstacles exist only in perception. They appear in the elevation and semantic maps, but physical collisions are disabled. The robot still walks on flat ground; legs can pass through obstacles without contact forces or episode termination. The paper characterizes this as learning under soft constraints.

This stage teaches the policy how to read semantic and elevation maps, how to infer collision-free foothold targets through the semantic-aware local search, and how to steer swing feet toward those targets. Because there are no rigid-body consequences, exploration remains dense and the semantic reward provides a usable training signal. Obstacle density is also curriculum-controlled so that the policy does not collapse into standing still.

Stage 2: rigid obstacles

In the second stage, the converged policy is transferred into environments with actual rigid obstacles and increased clutter. This is the hard-constraint stage. The policy now adapts previously learned semantic stepping behavior to true contact dynamics, including frictional interaction, accidental contact, body collision, and the need to physically clear obstacles.

The paper argues that this decomposition is crucial. Without the virtual-obstacle stage, training performance degrades sharply, indicating that semantic foothold selection and stable gait learning are too tightly coupled to learn efficiently from scratch in rigid clutter.

Network and training configuration

The network contains a CNN encoder for the exteroceptive maps, actor and critic MLPs, and a separate base-velocity estimator. The map features are concatenated with the remaining observations and passed through an MLP with hidden sizes $0.7\,\text{m/s}$ 7 and ELU activations. The base-velocity estimator is an MLP with hidden sizes $0.7\,\text{m/s}$ 8, also using ELU, trained with simulator ground-truth velocity.

Training is performed in IsaacLab with $0.7\,\text{m/s}$ 9 parallel Unitree Go2 simulations on a single NVIDIA RTX 5090. The reported training times are about $1.5\,\text{m}$ 0 hours for the soft virtual stage and $1.5\,\text{m}$ 1 hours for the hard rigid stage. The simulation evaluation environment is Isaac Sim. The paper states that a Symmetric Actor-Critic framework is adopted and later says that PPO is used to train policies; this suggests an actor-critic implementation optimized with PPO, although the explicit clipped PPO objective is not printed in the provided formulation (Liang et al., 3 Mar 2026).

5. Evaluation, ablations, and real-world behavior

Simulation evaluation uses a $1.5\,\text{m}$ 2 straight track with obstacle densities of $1.5\,\text{m}$ 3, $1.5\,\text{m}$ 4, $1.5\,\text{m}$ 5, and $1.5\,\text{m}$ 6 obstacles/ $1.5\,\text{m}$ 7. The robot is commanded to move at $1.5\,\text{m}$ 8, and each policy is evaluated on the same $1.5\,\text{m}$ 9 randomly generated environments per density. The reported metrics are success rate $1.2\,\text{m}$ 0, average distance to failure $1.2\,\text{m}$ 1, and step collision rate $1.2\,\text{m}$ 2 (Liang et al., 3 Mar 2026).

Obstacle density	Full SemLoco	Blind baseline
$1.2\,\text{m}$ 3	$1.2\,\text{m}$ 4, $1.2\,\text{m}$ 5, $1.2\,\text{m}$ 6	$1.2\,\text{m}$ 7, $1.2\,\text{m}$ 8, $1.2\,\text{m}$ 9
$0.05\,\text{m}$ 0	$0.05\,\text{m}$ 1, $0.05\,\text{m}$ 2, $0.05\,\text{m}$ 3	$0.05\,\text{m}$ 4, $0.05\,\text{m}$ 5, $0.05\,\text{m}$ 6
$0.05\,\text{m}$ 7	$0.05\,\text{m}$ 8, $0.05\,\text{m}$ 9, $30\times24$ 0	$30\times24$ 1, $30\times24$ 2, $30\times24$ 3
$30\times24$ 4	$30\times24$ 5, $30\times24$ 6, $30\times24$ 7	$30\times24$ 8, $30\times24$ 9, $720$0

Across these densities, the full method reduces step collisions by roughly $720$1–$720$2 relative to the blind policy. The most important ablation concerns the curriculum: without the virtual-obstacle stage, performance at $720$3 obstacles/$720$4 drops from $720$5 to $720$6 success, average distance drops from $720$7 to $720$8, and collision rate rises from $720$9 to $\mathbf{O}_t= \left[ \mathbf{C}_t^{\text{cmd}}, \mathbf{B}_t, \mathbf{O}_t^{\text{proprio}}, \mathbf{O}_t^{\text{extero}} \right]\in\mathbb{R}^{1513}.$ 0. This supports the claim that soft-constraint pretraining is not an auxiliary trick but a structural part of the method.

The clearance formulation is also consequential. Replacing the unilateral ReLU-style minimum-clearance penalty with stricter trajectory tracking raises the collision rate at $\mathbf{O}_t= \left[ \mathbf{C}_t^{\text{cmd}}, \mathbf{B}_t, \mathbf{O}_t^{\text{proprio}}, \mathbf{O}_t^{\text{extero}} \right]\in\mathbb{R}^{1513}.$ 1 obstacles/ $\mathbf{O}_t= \left[ \mathbf{C}_t^{\text{cmd}}, \mathbf{B}_t, \mathbf{O}_t^{\text{proprio}}, \mathbf{O}_t^{\text{extero}} \right]\in\mathbb{R}^{1513}.$ 2 from $\mathbf{O}_t= \left[ \mathbf{C}_t^{\text{cmd}}, \mathbf{B}_t, \mathbf{O}_t^{\text{proprio}}, \mathbf{O}_t^{\text{extero}} \right]\in\mathbb{R}^{1513}.$ 3 to $\mathbf{O}_t= \left[ \mathbf{C}_t^{\text{cmd}}, \mathbf{B}_t, \mathbf{O}_t^{\text{proprio}}, \mathbf{O}_t^{\text{extero}} \right]\in\mathbb{R}^{1513}.$ 4 and reduces success from $\mathbf{O}_t= \left[ \mathbf{C}_t^{\text{cmd}}, \mathbf{B}_t, \mathbf{O}_t^{\text{proprio}}, \mathbf{O}_t^{\text{extero}} \right]\in\mathbb{R}^{1513}.$ 5 to $\mathbf{O}_t= \left[ \mathbf{C}_t^{\text{cmd}}, \mathbf{B}_t, \mathbf{O}_t^{\text{proprio}}, \mathbf{O}_t^{\text{extero}} \right]\in\mathbb{R}^{1513}.$ 6. The result indicates that cluttered locomotion requires slack for higher-than-nominal swing trajectories.

Ablation without the semantic map is more nuanced. In simulation, traversal remains relatively strong, and at $\mathbf{O}_t= \left[ \mathbf{C}_t^{\text{cmd}}, \mathbf{B}_t, \mathbf{O}_t^{\text{proprio}}, \mathbf{O}_t^{\text{extero}} \right]\in\mathbb{R}^{1513}.$ 7 obstacles/ $\mathbf{O}_t= \left[ \mathbf{C}_t^{\text{cmd}}, \mathbf{B}_t, \mathbf{O}_t^{\text{proprio}}, \mathbf{O}_t^{\text{extero}} \right]\in\mathbb{R}^{1513}.$ 8 the success rate stays at $\mathbf{O}_t= \left[ \mathbf{C}_t^{\text{cmd}}, \mathbf{B}_t, \mathbf{O}_t^{\text{proprio}}, \mathbf{O}_t^{\text{extero}} \right]\in\mathbb{R}^{1513}.$ 9, but the collision rate worsens from $8$00 to $8$01. The paper interprets this as evidence that idealized synthetic obstacles can sometimes be handled from elevation alone, while real low-profile hazards expose the need for semantics.

Real-world tests are qualitative in the provided description but emphasize three behaviors. First, depth and elevation often fail to show small objects such as power cables and smartphones, whereas the semantic map highlights them clearly as hazards. Second, SemLoco exhibits proactive stride modulation: if a nominal step would land on a phone, the controller may shorten one stride to land safely before it and then execute a later high-clearance swing to pass the object. Third, compared with the blind baseline, SemLoco avoids the bulldozer-like behavior of marching through clutter, kicking and trampling items.

6. Relation to adjacent work, misconceptions, and limitations

SemLoco is positioned against three neighboring lines of work. Relative to geometry-based perceptive locomotion, its contribution is to decouple semantic hazard from geometric height; low-profile objects may be semantically costly while barely visible in elevation maps. Relative to semantic navigation and planning, its contribution is to bring semantics down to foothold selection rather than limiting them to path choice or gait-level adjustment. Relative to optimization-based foothold planners, it couples local refinement to a Raibert-style nominal foothold so that target foot placements remain dynamically coherent with the gait phase and commanded motion (Liang et al., 3 Mar 2026).

Several misconceptions follow from the name. SemLoco is not a global semantic navigation stack; it is a local semantic-contact planner operating over a robot-centered semantic and elevation map. It is not geometry-free; it explicitly fuses semantics with an elevation map and uses both in the policy observation. It is also not an open-vocabulary scene-understanding system; the paper states that semantics are compressed into a single fragility or traversability cost rather than richer open-vocabulary behavior. A plausible implication is that the current semantic representation is intentionally task-specific rather than fully general.

The reported limitations are equally specific. On the dynamics side, extreme asymmetric footholds can induce angular momentum and yaw drift, joints may approach kinematic singularities, aggressive clearance trajectories may exceed actuator bandwidth, and tracking lag can still cause toe-stubbing. On the perception and simulation side, obstacles in simulation are simplified geometric primitives, producing a sim-to-real gap in object-boundary realism. The semantic representation is coarse, and the paper does not claim richer semantic reasoning than traversability-cost assignment.

The name also requires disambiguation. SemLoco refers here to the locomotion framework in "Watch Your Step: Learning Semantically-Guided Locomotion in Cluttered Environment" (Liang et al., 3 Mar 2026). It is distinct from "SemLoc: Structured Grounding of Free-Form LLM Reasoning for Fault Localization," which addresses semantic bug localization in Python programs rather than robot locomotion (Yang et al., 31 Mar 2026). It is also distinct from LOCI and Loci-Segmented, which focus on disentangling location and identity in object-centric video tracking and scene segmentation rather than foothold planning for legged robots (Traub et al., 2022, Traub et al., 2023).

Taken together, SemLoco reframes safe indoor legged locomotion as a semantic contact-planning problem. Its core claim is not merely that semantics improve locomotion, but that semantics must influence explicit foothold targets and low-level control if a robot is to avoid stepping on the wrong things.