HumanoidVLM: Retrieval-Augmented Manipulation Control

Updated 4 July 2026

HumanoidVLM is a retrieval-augmented vision-language framework that converts egocentric camera images into task-specific impedance and gripper parameters.
It uses semantic scene understanding with a FAISS-based retrieval module to select experimentally validated control settings for contact-rich manipulation tasks.
Experimental results on the Unitree G1 show high retrieval accuracy and robust performance, underscoring its potential for adaptive, interpretable robotic control.

HumanoidVLM is a retrieval-augmented vision-language framework for humanoid manipulation that maps an egocentric camera image directly to task-appropriate low-level interaction parameters for the Unitree G1 humanoid. Rather than generating motor torques or end-to-end policies, it uses a vision-LLM as a semantic inference front-end, a FAISS-based Retrieval-Augmented Generation module to select experimentally validated Cartesian impedance and gripper parameters, and a task-space impedance controller for compliant execution. In this sense, HumanoidVLM is best understood as a semantic-to-control bridge for contact-rich tabletop manipulation, especially in scenarios where fixed, hand-tuned impedance gains and manually chosen gripper settings are too rigid for the task and object at hand (Mahmoud et al., 21 Jan 2026).

1. Definition, scope, and problem formulation

HumanoidVLM addresses contact-rich tabletop manipulation by a humanoid robot, with emphasis on tasks in which safe and effective interaction depends on matching compliance to the object and task. The examples studied include following an irregular surface, applying pressure while holding a massage ball, bimanual placement of a sauce bottle and an egg, tool interaction with a fork, and grasping and lifting tabletop objects. The motivating claim is that a single stiffness and damping choice cannot simultaneously support soft contact for fragile objects, stronger interaction for pressing tasks, moderate transient forces for poking with a tool, and stable but not crushing grasp for lifting (Mahmoud et al., 21 Jan 2026).

The system is framed against two limitations of conventional practice: fixed, hand-tuned impedance gains and manually specified gripper configurations. These are presented as inadequate for scene geometry changes, object properties, and task intent. HumanoidVLM therefore uses semantic scene understanding to infer what interaction regime is required, but it does not ask the vision-LLM to predict continuous control gains directly. Instead, the semantic task label inferred from the image becomes a retrieval key into a small database of experimentally validated control parameters, which are then executed through a Cartesian impedance controller (Mahmoud et al., 21 Jan 2026).

This positioning distinguishes HumanoidVLM from several adjacent strands in humanoid vision-language research. HuBo-VLM is a robot-oriented multimodal perception-and-language grounding model whose demonstrated capability is instruction-conditioned localization of a referred object in an image, not low-level control selection (Dong et al., 2023). Trinity combines RL, LLM, and VLM for full humanoid manipulation pipelines, but its VLM is used primarily for operable-part localization and 3D grounding before planning rather than impedance retrieval (Sun et al., 11 Mar 2025). SafeHumanoid similarly links VLM-RAG to safe impedance and velocity scheduling, but focuses on upper-body impedance and human-presence-aware safety rather than the specific contact-rich manipulation framing of HumanoidVLM (Mahmoud et al., 28 Nov 2025).

2. System architecture and inference pipeline

The full HumanoidVLM pipeline begins with a single egocentric RGB image from the humanoid’s head-mounted camera and ends with compliant execution on the robot. The architecture has three stages: vision-language reasoning, retrieval of control parameters, and task-space execution. The image is processed by Molmo-7B-O BnB 4-bit, which infers the manipulation task through structured visual reasoning, specifically sequential yes/no queries. The output of this stage is a semantic task label describing the manipulation scenario (Mahmoud et al., 21 Jan 2026).

That task label is embedded with the all-MiniLM-L6-v2 sentence-transformer and passed to a FAISS-based Retrieval-Augmented Generation component. Retrieval is performed over two custom databases: a Cartesian impedance database storing task-specific stiffness and damping coefficients, and a gripper-angle database storing object-specific grasp configuration. The retrieved outputs are impedance parameters $K$ , damping parameters $D$ , and a gripper angle or gripper action $\gamma$ . More concretely, the paper identifies the translational stiffness coefficients as $K = [K_x, K_y, K_z]$ , the damping coefficients as $D = [D_x, D_y, D_z]$ , and the gripper angle as $\gamma_a$ for arm $a$ (Mahmoud et al., 21 Jan 2026).

The retrieved parameters are sent to the onboard Unitree G1 computer, where a task-space Cartesian impedance controller generates compliant end-effector trajectories, inverse kinematics converts them to joint targets, and the robot’s built-in position controllers execute them. The system therefore does not use the VLM as a direct policy generator. Its output is an interpretable control parameterization for a conventional controller stack (Mahmoud et al., 21 Jan 2026).

A notable design choice is the two-stage retrieval process. First, the VLM-inferred task label is used to retrieve the matching impedance scenario. Second, the retrieved scenario description is concatenated with the VLM label, and this combined text is used as a new query for gripper retrieval. The paper states that this helps disambiguate visually similar tasks differing in object type or interaction mode. The exact embedding dimensionality, FAISS index type, distance metric, number of nearest neighbors, and ranking function are not provided (Mahmoud et al., 21 Jan 2026).

This architectural decomposition places HumanoidVLM in a broader class of modular humanoid systems that separate semantic reasoning from physical execution. Trinity explicitly separates VLM perception, LLM task planning, and RL lower-body stabilization (Sun et al., 11 Mar 2025). Being-0 inserts a lightweight VLM-based Connector between a cloud foundation model and a humanoid skill library to ground language plans into embodied navigation and manipulation decisions (Yuan et al., 16 Mar 2025). OpenHLM, by contrast, studies a whole-body native vision-language-action model that maps language and pixels directly to all humanoid degrees of freedom, making it substantially more end-to-end than HumanoidVLM’s retrieval-mediated control design (Hu et al., 20 Jun 2026).

3. Retrieval-augmented parameter databases

The retrieval module is central to the system’s interpretability. Rather than regressing continuous gains from a neural model, HumanoidVLM retrieves from two JSON databases of experimentally validated parameters. The impedance database contains nine manipulation tasks. Each entry stores a semantic task label together with translational stiffness and damping vectors, and each canonical entry is the most stable and compliant configuration identified through repeated real-world experiments on the Unitree G1. The gripper-angle database stores experimentally determined optimal closing angles for rigid, soft, deformable, and fragile objects; the conclusion also refers to nine object-specific grasp entries (Mahmoud et al., 21 Jan 2026).

The values in these databases are not learned end-to-end. The paper states that each scenario was executed with multiple impedance and gripper settings, the authors manually evaluated which combinations produced the most stable and compliant behavior, and those settings were stored as canonical entries. The stored entries are therefore described as experimentally validated and human-validated. The paper does not provide exact trial counts, search grids, JSON examples, ranking details, top- $k$ selection, confidence scores, or tie-breaking logic (Mahmoud et al., 21 Jan 2026).

This makes retrieval not merely an implementation convenience but the main semantic constraint on the controller. Adaptation breadth is bounded by database support, and the method assumes that the current object or task is semantically close to a known entry. The paper explicitly treats this as a limitation: the current coverage is bounded by the small databases, rotational impedance is fixed rather than task-dependent, no force or visuotactile feedback is integrated, and parameter selection is discrete rather than continuous (Mahmoud et al., 21 Jan 2026).

A plausible implication is that HumanoidVLM prioritizes traceability over expressive control. That tradeoff is consistent with related systems such as SafeHumanoid, which also uses a curated scenario database to restrict runtime decisions to validated safe parameter envelopes rather than allowing a VLM to invent control gains (Mahmoud et al., 28 Nov 2025).

4. Control formulation and execution model

After retrieval, HumanoidVLM executes manipulation through a task-space Cartesian impedance controller. A critical hardware constraint is that the Unitree G1 does not provide wrist force-torque sensing and the two-finger grippers do not provide reaction wrench feedback. To compensate, the paper uses a virtual task-space mass-spring-damper model and interprets the resulting virtual force as a proxy for interaction force (Mahmoud et al., 21 Jan 2026).

For each arm $a \in \{L, R\}$ , the controller uses diagonal translational mass, damping, and stiffness matrices: $\begin{aligned} \mathbf{M}_a &= \mathrm{diag}(M_{a,x}, M_{a,y}, M_{a,z}), \ \mathbf{D}_a &= \mathrm{diag}(D_{a,x}, D_{a,y}, D_{a,z}),\ \mathbf{K}_a &= \mathrm{diag}(K_{a,x}, K_{a,y}, K_{a,z}). \end{aligned}$

The paper’s typesetting is corrupted in the error definition, but the intended equations are clear enough. The position and velocity errors are: $D$ 0

The second-order task-space impedance dynamics are: $D$ 1

The resulting virtual force is: $D$ 2

The paper states that this virtual force captures the interaction response, acts as a quantitative proxy for physical contact forces, and converges to zero when there is no disturbance and pose error vanishes. Rotational impedance remains fixed for stability, virtual mass is fixed, and desired virtual poses are converted into joint targets through inverse kinematics before execution by built-in position controllers. The paper does not provide rotational error equations, Jacobian-based wrench-to-torque mappings, or joint-torque equations (Mahmoud et al., 21 Jan 2026).

Gripper actuation is represented as: $D$ 3 with each action mapped to a predefined joint-angle target retrieved from the database. The paper’s interpretive account of the gains is straightforward: low stiffness yields more compliant motion and smaller virtual forces under displacement; higher stiffness supports stronger contact and larger virtual force responses; damping regulates settling and transients; and asymmetric bimanual gains allow different compliance on the two sides during fragile-object handling (Mahmoud et al., 21 Jan 2026).

This control layer is narrower than the full embodied stacks found in systems such as Trinity, where RL lower-body locomotion, an LLM planner, and VLM grounding are integrated around articulated-object manipulation (Sun et al., 11 Mar 2025). It is also narrower than OpenHLM’s whole-body native VLA formulation, which directly outputs 34-dimensional whole-body action targets rather than retrieved impedance settings (Hu et al., 20 Jun 2026).

5. Experimental setup and empirical results

Experiments were performed on the Unitree G1 humanoid robot, equipped with two 1-DoF grippers and an Intel RealSense RGB-D camera mounted in the head for egocentric perception. The task-space impedance controller ran on the onboard PC at 50 Hz, while the VLM-RAG pipeline ran on an external workstation with an RTX 4090 GPU and Intel i9-13900K CPU to ensure real-time inference. Retrieval was evaluated on 14 ego-centric test images, designed as variations of the nine task types in the database with changes in camera viewpoint, object placement, and arm pose (Mahmoud et al., 21 Jan 2026).

A retrieval is counted as correct only if all three stages are correct: the task is classified correctly via sequential yes/no visual queries, the correct impedance entry is retrieved based on the VLM output, and the correct gripper configuration is retrieved conditioned on both the VLM task label and the impedance scenario. Under this criterion, the system correctly retrieved impedance and gripper parameters in 13 out of 14 cases, yielding 93% retrieval accuracy. The only reported failure occurred when the primary object was partially occluded (Mahmoud et al., 21 Jan 2026).

The core control results are reported for six representative scenarios:

Task	$D$ 4 [N/m]	$D$ 5 [Ns/m]
Follow surface (R)	3.0	2.0
apply pressure (R)	5.0	3.0
Dual placement (R)	2.0	1.0
Dual placement (L)	6.0	1.5
Tool interaction (R)	2.0	1.5
Grasp from table (R)	4.0	1.5

For these same tasks, the reported mean absolute $D$ 6-errors are $D$ 7, $D$ 8, $D$ 9, $\gamma$ 0, $\gamma$ 1, and $\gamma$ 2 m, respectively; the maximum absolute $\gamma$ 3-errors are $\gamma$ 4, $\gamma$ 5, $\gamma$ 6, $\gamma$ 7, $\gamma$ 8, and $\gamma$ 9 m; and the maximum virtual normal-force magnitudes are $K = [K_x, K_y, K_z]$ 0, $K = [K_x, K_y, K_z]$ 1, $K = [K_x, K_y, K_z]$ 2, $K = [K_x, K_y, K_z]$ 3, $K = [K_x, K_y, K_z]$ 4, and $K = [K_x, K_y, K_z]$ 5 in arbitrary units (Mahmoud et al., 21 Jan 2026).

The abstract summarizes these control results by stating that $K = [K_x, K_y, K_z]$ 6-axis tracking errors were typically within 1–3.5 cm and that virtual forces were consistent with task-dependent impedance settings. The paper interprets the measurements task by task: soft surface following produces low virtual force and small tracking error; pressure application uses higher stiffness and damping to produce larger virtual forces while preserving similar accuracy; dual-object placement uses asymmetric bimanual gains to reflect object fragility; tool interaction uses moderate gains for controlled poking; and grasp-and-lift uses intermediate stiffness for stable lifting without excessive pressure (Mahmoud et al., 21 Jan 2026).

The experimental evidence remains proof-of-feasibility rather than a full benchmark. The paper does not report fixed-gain baselines, direct VLM prediction baselines, non-retrieval controller baselines, learned-mapping baselines, or one-stage versus two-stage retrieval ablations. It also does not provide latency numbers for the VLM-RAG inference loop, despite stating that the offboard setup was used to ensure real-time inference (Mahmoud et al., 21 Jan 2026).

6. Interpretation, limitations, and relation to the humanoid VLM literature

HumanoidVLM’s primary contribution is an interpretable semantic-to-control interface for contact-rich humanoid manipulation. The framework links semantic scene understanding to task-specific Cartesian impedance and gripper control through retrieval, rather than hiding control generation inside a black-box policy. The inferred task label is human-readable, the retrieved entries are explicit database records, and the final outputs are transparent physical parameters $K = [K_x, K_y, K_z]$ 7, $K = [K_x, K_y, K_z]$ 8, and gripper settings (Mahmoud et al., 21 Jan 2026).

Its practical strengths are therefore also its main limitations. The method is deployable on a real humanoid robot, does not require wrist force-torque sensing, and demonstrates stable execution across several tabletop tasks. At the same time, its adaptation is only as broad as database coverage, its visual inference can fail under occlusion, rotational impedance is fixed, force and visuotactile feedback are absent, and the current results are explicitly positioned as a proof of feasibility rather than a comprehensive comparative study (Mahmoud et al., 21 Jan 2026).

Within the broader humanoid vision-language literature, HumanoidVLM occupies a relatively narrow but technically distinctive niche. HuBo-VLM is closer to instruction-conditioned visual grounding for robot perception than to low-level interaction control (Dong et al., 2023). Being-0 shows how a lightweight onboard VLM can act as a Connector between a cloud foundation model and modular humanoid skills, emphasizing navigation-manipulation coordination and active vision rather than impedance retrieval (Yuan et al., 16 Mar 2025). Cybo-Waiter pushes in another direction, turning VLM plans into typed JSON task programs and verifying execution through 3D multi-object grounding and predicate-based supervision (Ren et al., 11 Mar 2026). BiBo instead builds a compiler-style bridge between off-the-shelf VLMs and humanoid motion execution, producing structured embodied commands for a diffusion-based whole-body motion executor (Jian et al., 28 Oct 2025). OpenHLM goes further toward a whole-body native vision-language-action model that maps language and pixels directly to the full humanoid action space (Hu et al., 20 Jun 2026).

This suggests a useful categorization. HumanoidVLM is neither a general humanoid foundation model nor a whole-body VLA. It is best described as a retrieval-based compliance-selection layer for semantic, contact-rich manipulation. Its significance lies less in whole-body autonomy than in showing that a humanoid can infer manipulation-relevant semantics from egocentric vision and convert them into experimentally validated low-level control parameters in an interpretable manner.