CLAW: Composable Language-Annotated Whole-body Motion Generation

Published 13 Apr 2026 in cs.RO | (2604.11251v2)

Abstract: Training language-conditioned whole-body controllers for humanoid robots requires large-scale datasets pairing motion trajectories with natural-language descriptions. Existing approaches based on motion capture are costly and limited in diversity, while text-to-motion generative models produce purely kinematic outputs that are not guaranteed to be physically feasible. Therefore, we present CLAW, an interactive web-based pipeline for scalable generation of language-annotated whole-body motion data for the Unitree G1 humanoid robot. CLAW treats the motion modes of a kinematic planner as composable building blocks, each parameterized by movement, heading, speed, pelvis height and duration, and provides two browser-based interfaces -- a real-time keyboard mode and a timeline-based sequence editor -- for exploratory and batch data collection. A low-level whole-body controller tracks the planner's kinematic references in MuJoCo simulation, producing physically grounded trajectories recorded at 50Hz. Simultaneously, a deterministic template-based annotation engine generates diverse natural-language descriptions at multiple stylistic registers for every segment and for the full trajectory. We release the system as open source to support scalable generation of language-motion paired data for humanoid robot learning.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a modular, planner-agnostic pipeline that synthesizes physically grounded, language-annotated motion for the Unitree G1 robot.
It employs a four-process system integrating a browser-based frontend, a C++ controller, and the MuJoCo simulator to enable real-time, smooth motion sequencing.
The methodology uses composable motion primitives and a deterministic, template-based language annotation engine to facilitate scalable, diverse data generation for language-conditioned control.

CLAW: Composable Language-Annotated Whole-body Motion Generation

Motivation and Contributions

The paper introduces dark_gray, a compositional, browser-based pipeline for large-scale generation of language-annotated, physically plausible whole-body motion data specifically targeting the Unitree G1 humanoid robot (2604.11251). The motivation arises from three fundamental bottlenecks observed in existing motion data acquisition paradigms: high cost and limited diversity in motion capture (MoCap), physically infeasible outputs in kinematics-only text-to-motion generative models, and the annotation cost when pairing natural language to each data sample. dark_gray addresses these by shifting to a pipeline where parameterized, composable motion primitives are sequenced in physics simulation, with automated, deterministic multi-style natural language annotation generated in tandem.

The primary contributions are:

An open-source, planner-agnostic, interactive pipeline capable of physically grounded, language-annotated whole-body motion data synthesis at arbitrary scale.
Two interactive user interfaces (keyboard mode and sequence editor) for both exploratory and systematic, reproducible data creation.
A template-based annotation engine generating diverse, style-controlled, deterministic natural language descriptions for both segments and long-horizon trajectories.
Modular, four-process system architecture suitable for extension and integration with new controllers and planners.

System Design and Architecture

The dark_gray pipeline comprises four loosely coupled processes: a browser-based frontend for human interaction, a protocol translation bridge (WebSocket–ZMQ), a C++ whole-body controller (for kinematic planning and real-time tracking), and the MuJoCo physics simulator (for providing physically realistic execution).

Figure 2: The dark_gray system architecture, detailing the interactions between backend controller, simulation, protocol bridge, and browser frontend.

The architecture is fully planner-agnostic; the frontend communicates structured, high-level commands to the bridge, which sequences these for the controller. The controller handles kinematic reference generation and feeds joint-level targets into the MuJoCo simulator. State feedback flows back throughout, supporting real-time telemetry and visualization. This modularity facilitates both rapid prototyping and systematic, scalable data collection.

Composable Motion Primitives and User Interfaces

dark_gray exposes 25 motion modes, grouped into Locomotion, Squat/Ground, Boxing, and Styled Walking, each parameterized by adjustable dimensions (movement, heading, speed, pelvis height, duration). Users sequence these via:

Keyboard mode: Free-form, interactive control suitable for exploration and on-the-fly generation of novel motion combinations.

Figure 3: Screenshot of keyboard mode, emphasizing real-time configuration of movement, posture, and style parameters.

Editor mode: GUI-based sequence editor, enabling reproducible, multi-segment trajectory design, ideal for batch generation and systematic data augmentation.

This compositional abstraction leverages smooth mode switching; transitions between primitives are handled within the planner using current state, with no explicit blending or post-processing required by the pipeline. Most mode pairs yield fluid transitions, though outlier mode switches (e.g., crawl to upright) can produce transient kinematic anomalies.

Figure 5: Examples illustrating both successful and failure cases in motion stitching between distinct behavioral primitives.

The sequence editor enables scalable augmentation through parameter replay—large numbers of semantically similar sequences can be programmatically instantiated by varying continuous parameters.

Language Annotation Engine

The pipeline systematically generates language annotations for every segment and full trajectory, based on the same motion-intent parameters that control the planner. This is achieved via a deterministic, template-driven generator supporting eight stylistic variants:

Four registers: instruction (imperative), natural, narrative (third-person), and concise (keyword).
Temporal grounding: each with/without explicit duration.
Figure 1: Diverse whole-body motion generation capabilities, with each segment and full trajectory automatically annotated in multiple natural language styles.

Annotations exploit synonym banks for verbs, directions, connectives, and tempo adverbs, producing lexical diversity in a reproducible, seed-dependent manner. This method provides consistent, stage-aligned textual labels suitable for supervision of language-conditioned control models.

Figure 4: Example of language annotation for long-horizon, multi-stage trajectories, with per-segment and full-sequence descriptive variants.

The proposed annotation engine balances diversity and determinism, and the architecture allows easy extension to LLM-based generation should paraphrastic richness become a priority.

Experimental Findings

The pipeline supports several strong capabilities:

Smooth composition of long-horizon, multi-behavior trajectories: Most motion primitive pairs transition fluidly, supporting continuous, complex motion design by simply sequencing primitives (Figure 4).
Highly scalable, reproducible data generation: Editor recipes are replayable and parameterizable, enabling infinite data synthesis limited only by compute, not real-world capture logistics.
Deterministic, diverse annotation aligned to motion: The annotation process is tightly coupled with trajectory segmentation, producing perfectly aligned ground truth with no manual effort.
Figure 6: The data generation pipeline: from user intent to time-aligned trajectory and multi-style annotation output.

No explicit numerical benchmarking is provided in the paper, but the practical utility of infinite, fully aligned trajectory–text data is highlighted for training downstream language-conditioned control policies, where data scale and coverage directly impact policy generalization and robustness.

Implications and Future Directions

The practical implications extend beyond the specific Unitree G1 platform. dark_gray demonstrates a model for tasking high-DOF robots with language-anchored, compositional motion planning without dependence on MoCap or laborious manual annotation. This addresses critical challenges in scaling up RL- and imitation learning-based robotics, where diversity and alignment of supervision are limiting factors.

Theoretically, the approach decouples motion diversity from physical embodiment constraints (no retargeting artifacts), while modular annotation invites research into the balance between template-based and generative language conditioning for robotic learning.

Future research directions include:

Incorporating LLM-based annotation for richer and less deterministic linguistic supervision.
Expanding the set of motion primitives (including user-defined or learned skills), further increasing behavioral coverage.
Automated filtering or correction of low-quality or physically implausible transitions to improve dataset consistency for large-scale learning.
Extension to real-robot data collection via hardware-in-the-loop, supporting sim-to-real transfer.

Conclusion

dark_gray presents a scalable, modular system for generating whole-body humanoid motion data, paired with diverse, deterministic natural language annotation. Its compositional design, web-based interactivity, and planned extensibility position it as an enabling tool for the language-conditioned control community and for research at the intersection of robotics, interactive simulation, and multi-modal learning. The pipeline circumvents critical bottlenecks in motion diversity, physical feasibility, and labeling alignment, with clear potential for extension into both research and applied domains.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces dark_gray, a web tool that helps people quickly create lots of “robot moves” for a human‑shaped robot (the Unitree G1) and automatically writes short, clear sentences that describe each move. Think of it like a video game level editor for robot motions, where every action the robot performs also gets its own caption. The goal is to make big, high‑quality datasets that can teach robots to follow natural‑language instructions with their whole body.

What questions are the researchers trying to answer?

Here are the main ideas they explore:

How can we make large, diverse datasets of robot motions without the cost and hassle of recording human actors with motion‑capture suits?
How can we be sure these motions are physically realistic (not just good‑looking animations)?
How can we automatically pair each motion with clear natural‑language descriptions so robots can learn to understand instructions?

How does dark_gray work? (Explained simply)

You can think of dark_gray as a “robot motion studio” with four main parts working together:

A browser interface (what you see): Two ways to create motions
- Keyboard mode: Like controlling a game character with keys to make the robot walk, turn, crouch, punch, and more in real time.
- Editor mode: Like a timeline in a video editor where you drag and drop motion clips (walk → turn → crawl, etc.) to build longer sequences you can replay exactly.
A planner (the choreographer): It turns your high‑level choices (like “run forward, turn left, go slow, keep hips low”) into a short plan for how the robot should move next.
A controller + physics simulator (the “body” and the “physics sandbox”):
- Whole‑body controller: Ensures the robot’s joints and balance follow the plan.
- MuJoCo simulator: A physics world that makes sure the movement would make sense in real life—no feet sliding or impossible poses.
A language engine (the caption writer): It uses templates and synonyms to automatically write multiple styles of descriptions for each motion segment and for the whole sequence, like:
- Instruction style: “Walk forward for 3 seconds.”
- Story style: “The robot strides ahead for 3 seconds.”
- Concise style: “walk forward 3s”

Some helpful translations of technical terms:

Motion primitive: A basic move, like “walk,” “run,” “turn,” “squat,” “punch,” or “crawl.” These are building blocks you can stack to make longer routines.
Kinematic planner: A tool that figures out the shapes and paths of motions (like a choreographer), without worrying about forces yet.
Whole‑body controller: A low‑level system that makes the robot actually follow the plan smoothly and safely.
Physics simulation (MuJoCo): A realistic “virtual world” where gravity, friction, and balance are checked—like testing a move in a safe sandbox before trying it for real.

What did they find, and why is it important?

The paper reports three main results:

Smoothly mixing different moves works well most of the time
- You can switch between different motion primitives—like walk → squat → crawl → punch—and the system often creates natural transitions without any manual “blending.” This makes long, interesting motion sequences possible.
- Sometimes, switching between very different poses (for example, crawl → punch) can look awkward. The authors note this as a challenge to improve.
Automatic language descriptions are accurate and varied
- For every segment and entire sequence, the system produces multiple descriptions in different styles, using templates and synonym lists. Because it’s template‑based, it’s consistent and repeatable—great for training AI systems that need reliable labels.
Scales to lots of data, fast
- No motion‑capture sessions are needed, which saves time and money.
- The editor can batch‑generate endless motion sequences with small changes (like different speeds, directions, or styles), creating a big, diverse training set.
- Motions are already in the robot’s own joint format, so there’s no “retargeting” step—another time saver.

Why this matters:

Training robots to understand and follow language commands (like “slowly crouch, then crawl forward, then stand and turn right”) needs many examples of motion paired with matching text. dark_gray makes those pairs quickly, cheaply, and with physics realism.

What could this mean for the future?

Faster robot learning: Researchers can build huge, high‑quality datasets to train robots that follow spoken or written instructions with their whole body.
More reliable motions: Because the motions are tested in physics, they’re more likely to work on real robots.
Easy to extend: The system is modular, so you can swap in different motion planners or add more move types later.
Next steps: The authors suggest adding LLMs for even richer descriptions, letting users define new motion primitives, and automatically detecting bad transitions to keep quality high.

In short, dark_gray is like a motion “recipe maker” and caption “writer” for humanoid robots. It helps researchers quickly create and label realistic, varied movements—exactly what’s needed to teach robots to understand and carry out human instructions.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper.

No real-world validation: the pipeline is demonstrated only in MuJoCo; there is no evaluation of sim-to-real transfer on a physical Unitree G1 (e.g., tracking fidelity, failure modes, latency/actuation limits).
Annotation–execution mismatch: language is derived from intended planner parameters rather than realized motion; the system does not verify or adjust text to reflect actual executed speeds, headings, durations, or path deviations.
Motion quality is unquantified: there are no metrics reported for foot slip, contact consistency, joint-limit/torque-limit violations, COM/ZMP stability, jerk, or smoothness across transitions.
Transition failures are not characterized: the frequency, severity, and conditions of abrupt transitions between mode pairs are unmeasured; no automated detector or filter is implemented to flag and remove low-quality transitions at scale.
Dataset coverage is unspecified: the paper provides no statistics on the distribution of modes, parameter ranges (speed, heading, height), sequence lengths, or transition graphs; sampling policies for broad coverage are not defined.
Impact on downstream learning is untested: there is no evidence that training with the generated motion–language data improves language-conditioned control compared to MoCap or text-to-motion baselines, either in simulation or on hardware.
Planner-agnostic claim is unverified: only the SONIC backend is used; integration with alternative planners/controllers is not demonstrated or benchmarked.
Fixed, limited behavior set: only 25 pre-defined modes are available, with no mechanism or API shown for adding user-defined primitives, learned skills, or task-specific behaviors.
Environment impoverishment: motions are generated on flat terrain without obstacles, stairs, uneven friction, or external objects (despite an “Object Carrying” style), limiting ecological validity and contact diversity.
Long-horizon robustness untested: stability over very long sequences (drift, cumulative tracking error, thermal/energy constraints, contact fatigue) is not measured.
Annotation expressivity is narrow: text encodes only mode, movement, heading, speed, height, and duration—omitting richer semantics such as distance traveled, absolute/relative positions, contact events, footfall timing, or task goals.
Language quality is unevaluated: there is no human or automatic assessment of fluency, naturalness, correctness, or stylistic variety; the effect of deterministic synonym banks on perceived diversity is unknown.
Temporal referencing choices are unchecked: the heavy reliance on durations (rather than human-salient distances or landmarks) is not validated with user studies or perception experiments.
No language–motion alignment metric: the paper lacks quantitative measures (e.g., alignment error between described and realized velocities/turn magnitudes, or text consistency via QA models) to ensure annotation fidelity.
Multilingual support is absent: the system is English-only; there is no plan or evaluation for multilingual templates or cross-lingual consistency.
Scaling characteristics are unreported: throughput (trajectories/hour), compute and memory footprint, and I/O bottlenecks for large-batch generation are not benchmarked; distributed/cloud deployments are unexplored.
Data packaging and standards are undefined: the schema, metadata, and interoperability with common datasets/formats (e.g., for contacts, forces, events) are not specified; no public dataset release or statistics are provided.
No automatic curation: there is no pipeline for deduplication, diversity-aware sampling, or balancing across modes/parameters to avoid over-represented behaviors.
Domain randomization is missing: physics and rendering parameters (friction, mass, sensor noise) are not randomized to support sim-to-real robustness.
Safety and feasibility constraints are unspecified: limits on joint velocities, torques, contact forces, and controller saturation are not enforced or reported during generation.
UI usability is unassessed: there is no user study on the keyboard/editor interfaces (speed of authoring, error rates, learnability), and no guidance on how operator bias affects data distributions.
Compositional language complexity is limited: annotations do not include coreference, anaphora, or higher-level task semantics across segments; it’s unclear whether such simplified language suffices for training models that generalize to natural free-form instructions.
Parameter-to-language calibration is ad hoc: mappings from continuous parameters to adverbs/phrases (e.g., “at full speed”) are heuristic and not grounded in human perception; boundary effects and ambiguity are not analyzed.
Generalization across embodiments is untested: although claimed backend-agnostic, there is no evidence the approach transfers to different humanoids or morphologies without re-tuning.
Comparisons to alternative data sources are missing: the pipeline is not benchmarked against MoCap-based or text-to-motion datasets in terms of diversity, realism, or downstream policy performance.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are concrete, deployable use cases that can be implemented with the system as described, leveraging its composable motion primitives, deterministic language annotations, and modular, planner-agnostic architecture.

Scalable generation of language–motion datasets for humanoid control
- Description: Use the browser editor or keyboard mode to synthesize large, reproducible corpora of physically grounded whole-body trajectories (50 Hz) paired with multi-style, deterministic natural-language descriptions.
- Sectors: Robotics, Software/AI tooling, Academia.
- Tools/workflows: dark_gray sequence editor and keyboard UI, WebSocket–ZMQ bridge, MuJoCo simulation, template-based annotation engine; “recipe-to-dataset” batch pipelines; seed/version-controlled dataset builds.
- Assumptions/dependencies: MuJoCo license and compute; access to a compatible kinematic planner/whole-body controller (e.g., SONIC); coverage limited to the 25 included motion modes; template-based language may limit linguistic richness.
Pretraining and finetuning of language-conditioned policies
- Description: Train supervised or offline RL policies to map language to kinematic references or joint-space targets using physically simulated, time-aligned data.
- Sectors: Robotics (humanoids), AI research.
- Tools/workflows: Data loaders for joint trajectories + text; curriculum generation by parameter sweeps (speed, heading, pelvis height); reproducible seeds for ablation studies.
- Assumptions/dependencies: Sim-to-real gap remains; controller fidelity and contact modeling quality in MuJoCo affect transfer; annotations cover semantics of motion primitives, not arbitrary tasks.
Rapid motion prototyping and choreography for demos
- Description: Compose and preview long-horizon motion sequences (e.g., walk → turn → kneel → crawl → stand) for trade shows, lab demos, or proof-of-concept videos.
- Sectors: Robotics industry, Media/entertainment (previsualization).
- Tools/workflows: Keyboard mode for exploration, timeline editor for reproducible routines; export of sequences and accompanying language scripts.
- Assumptions/dependencies: On-robot execution requires robust tracking and safety constraints; some inter-mode transitions may be abrupt and should be manually filtered.
Benchmarking and regression testing for planners/controllers
- Description: Establish fixed “recipes” to repeatedly evaluate controller robustness, transition smoothness, gait stability, and failure transitions across software revisions.
- Sectors: Robotics R&D, Software QA.
- Tools/workflows: Deterministic scenario suites, KPI metric harnesses (e.g., foot slip, COM excursions, tracking error), automated pass/fail gating.
- Assumptions/dependencies: Requires metric definitions and logging; failure modes noted in the paper (e.g., crawl → boxing) should be included as stress tests.
Education and training in whole-body control and HRI
- Description: Use the browser-based interfaces in classrooms to teach kinematics, locomotion, whole-body tracking, and language grounding without motion-capture infrastructure.
- Sectors: Education (university courses, bootcamps).
- Tools/workflows: Instructor-provided recipes, parameter variation exercises, reproducible labs; student assignments on data generation and model training.
- Assumptions/dependencies: Classroom access to MuJoCo-compatible machines; faculty-developed teaching materials.
Internal “dataset factory” for humanoid teams
- Description: Run simulator farms that generate language–motion pairs at scale, varying parameters systematically for coverage.
- Sectors: Robotics/AI companies.
- Tools/workflows: Containerized workers; job queues that sample motion modes, seeds, and parameters; provenance tracking (seed, config, version).
- Assumptions/dependencies: Orchestration and storage infrastructure; cost control for compute and licensing.
Controlled HRI studies on language grounding
- Description: Investigate how linguistic register and temporal grounding (with/without durations) influence policy learning and interpretability.
- Sectors: Academia (HRI, NLP + Robotics).
- Tools/workflows: Eight-style deterministic annotations; ablation across registers; evaluation of instruction-following accuracy.
- Assumptions/dependencies: Template-driven phrasing may bias outcomes; content is constrained to supported primitives.
Synthetic data for vision/pose pipelines (optional rendering)
- Description: Render camera feeds from MuJoCo alongside ground-truth kinematics and aligned language to train cross-modal perception or captioning models.
- Sectors: Computer vision, Multimodal AI.
- Tools/workflows: Simulator camera rigs; export of RGB + depth + joint states + text; multimodal dataset packaging.
- Assumptions/dependencies: Rendering integration and domain randomization required; pipeline currently centers on motion/language, not photorealistic visuals.
Audit-ready documentation of robot motion
- Description: Log executed recipes with deterministic language summaries for internal compliance and traceability in development pipelines.
- Sectors: Industrial QA/compliance.
- Tools/workflows: Runbooks that pair seeds, configs, and natural-language summaries; reproducibility checks in CI.
- Assumptions/dependencies: Internal policies and documentation standards; alignment with future regulatory requirements.
Content previsualization for robot-centric experiences
- Description: Quickly generate physically plausible motion beats for robot performances or mixed reality experiences before transferring key motions to on-robot controllers.
- Sectors: Entertainment, Events.
- Tools/workflows: Sequence editor for beatboards; exported timings and story text from annotations.
- Assumptions/dependencies: Export formats to DCC tools may need adapters; fine motion styling limited by available modes.

Long-Term Applications

These opportunities require additional research, scaling, backend extensions, or sim-to-real advances before broad deployment.

Language-driven whole-body control on real humanoids (sim-to-real)
- Description: Train policies that execute verbal instructions end-to-end on physical robots (e.g., Unitree G1) using synthetic pretraining plus on-robot finetuning.
- Sectors: Robotics (industrial, service), Daily life (future consumer robots).
- Potential products/workflows: On-robot instruction-following stacks; safety-verified deployment pipelines; closed-loop synthetic→real training regimes.
- Assumptions/dependencies: High-fidelity contact modeling; robust whole-body controllers; perception and safety systems; rigorous sim-to-real adaptation.
Cross-platform, standardized language–motion corpora and benchmarks
- Description: Multi-vendor datasets and task suites for evaluating language-conditioned humanoid control, supporting regulatory and interoperability goals.
- Sectors: Policy/standards, Industry consortia, Academia.
- Potential products/workflows: Benchmark task libraries, public leaderboards, standardized annotation schemas and seeds for reproducibility.
- Assumptions/dependencies: Broad adoption across robot morphologies; extensible primitive sets; governance for data quality and safety.
Natural-language programming of robots (programming-by-description)
- Description: Non-experts specify tasks in everyday language that compile to sequences of composable whole-body skills.
- Sectors: Industrial automation, Facilities operations, Daily life (assistive robots).
- Potential products/workflows: NL-to-skill compilers; operator UIs; verification layers that simulate and validate before execution.
- Assumptions/dependencies: Expanded primitive libraries (manipulation, tool use); robust semantic parsing; safety and failover policies.
LLM-augmented, multilingual, and richer annotation ecosystems
- Description: Enhance template annotations with LLM paraphrasing, multilingual support, and task-level narratives for broader generalization.
- Sectors: NLP+Robotics, Global deployments.
- Potential products/workflows: Annotation plugins with controllable variation; bias/fairness auditing; language coverage expansion.
- Assumptions/dependencies: Managing nondeterminism; preventing hallucinations and drift; data governance; compute costs.
Automatic transition QA and motion quality scoring
- Description: Learn metrics and detectors for low-quality transitions to auto-filter or penalize problematic mode pairs during dataset generation.
- Sectors: Robotics MLOps, QA tooling.
- Potential products/workflows: Transition scorecards, failure case discovery services, CI gates for motion quality.
- Assumptions/dependencies: Ground-truth labels or proxy metrics; calibration across robot morphologies.
Extensible primitive marketplaces and learned skills
- Description: Integrate user-defined and learned primitives (e.g., manipulation, terrain negotiation) into the compositional framework; shareable skill libraries.
- Sectors: Robotics software ecosystem, App marketplaces.
- Potential products/workflows: Skill packaging and distribution; versioned skill registries; safe composition validators.
- Assumptions/dependencies: Interfaces for skill specification; verification in sim; IP/licensing frameworks.
Facility-scale digital twins for robot readiness
- Description: Combine composable whole-body motion with environment models to stress-test navigation and posture transitions before deployment.
- Sectors: Logistics, Manufacturing, Energy.
- Potential products/workflows: Scenario generators; hazard simulations; readiness reports for task rollouts.
- Assumptions/dependencies: High-fidelity environment and interaction simulation; integration with perception and mapping.
Assistive and healthcare robotics motion training
- Description: Train safe, compliant whole-body behaviors (e.g., careful walking, kneeling, posture transitions) as steps toward assistive applications.
- Sectors: Healthcare, Elder care.
- Potential products/workflows: Safety-constrained policy learning; compliance verification; clinician-in-the-loop validation workflows.
- Assumptions/dependencies: Human–robot contact modeling; stringent safety/regulatory approvals; ethical frameworks.
Disaster response and defense training
- Description: Prepare language-directed agility (e.g., crawling, cautious gait) for complex, unstructured terrains and mission scripting.
- Sectors: Public safety, Defense.
- Potential products/workflows: Scenario packs (rubble, confined spaces); mission-level language planning; robust execution monitors.
- Assumptions/dependencies: Terrain/interaction realism; ruggedized hardware; secure operational protocols.
Synthetic data marketplaces with provenance guarantees
- Description: Commercialize deterministic, seed-reproducible language–motion datasets with traceable provenance for industrial buyers.
- Sectors: AI data markets, Robotics vendors.
- Potential products/workflows: Dataset SKUs by mode coverage and parameter ranges; audit trails; licensing services.
- Assumptions/dependencies: Legal/ethical compliance; curation and QA; alignment with emerging data standards.

Notes on feasibility across applications:

Core dependencies include a compatible kinematic planner and whole-body controller (SONIC in the paper), MuJoCo-based physics fidelity, and the current library of 25 motion modes.
Deterministic, template-based annotations ensure reproducibility but may limit linguistic diversity and domain transfer; LLM augmentation can improve breadth at the cost of determinism and curation burden.
Some inter-mode transitions show artifacts; quality gates and automated QA will be important for scaling to safety-critical deployments.
Sim-to-real transfer remains the primary long-term constraint for real-world execution, requiring advances in control, perception, contact modeling, and safety engineering.

View Paper Prompt View All Prompts

Glossary

AMASS: A large-scale human motion capture dataset providing unified, high-coverage motion data for learning and synthesis. "Large-scale MoCap datasets such as AMASS~\cite{mahmood2019amass} have greatly expanded the coverage of human motion,"
Annotation engine: A system that automatically turns structured motion parameters into natural-language descriptions. "A template-based annotation engine simultaneously produces diverse natural-language descriptions from the same motion parameters, yielding time-aligned trajectory and language data."
Black-box motion-primitive engine: A component treated purely by its inputs/outputs that generates reference motions from high-level commands. "dark_gray treats the kinematic planner as a black-box motion-primitive engine."
Composable primitives: Modular motion behaviors that can be combined to form longer, diverse trajectories. "dark_gray treats the motion modes of a kinematic planner as composable primitives, each governed by parameters like movement, heading, speed, pelvis height and duration."
Configuration space: The space of all possible joint configurations of the robot used to represent motions directly. "the recorded joint trajectories are directly in the robot's configuration space."
Control cycle: A single iteration of the control loop in which commands are processed and applied. "Because every keypress is translated into a planner command within a single control cycle, keyboard mode is well suited for free-form exploration and rapid prototyping of novel motion sequences."
Diffusion models: Generative models that iteratively denoise samples to synthesize data, here used for text-conditioned motion generation. "text-to-motion diffusion models like Kimodo~\cite{rempe2026kimodo} are trained to synthesize diverse whole-body movements from language prompts."
Dynamic trajectories: Motions produced by a physics simulator in response to control inputs, capturing physically consistent dynamics. "kinematic reference trajectories (the planner's output) and dynamic trajectories (the MuJoCo-simulated response to those references)."
Embodiments (disparate embodiments): Different physical forms or morphologies that can complicate transferring motions between systems. "retargeting across disparate embodiments introduces kinematic artifacts—foot sliding, self-penetration, and joint-limit violations—"
Foot sliding: An artifact where the feet appear to slip unrealistically during contact due to kinematic inconsistencies. "foot sliding, self-penetration, and joint-limit violations—"
Gait style: The qualitative pattern of locomotion (e.g., walking nuances) as a controllable motion attribute. "pelvis height and gait style—properties that are critical for systematic data augmentation."
Heading: The facing direction of the robot independent of its movement direction. "each parameterized by movement, heading, speed, pelvis height and duration"
Joint-limit violations: Motions that exceed the mechanical range of joints, causing unphysical or unsafe configurations. "foot sliding, self-penetration, and joint-limit violations—"
Joint trajectories: Time series of joint angles/velocities generated by tracking or planning. "to yield physically grounded joint trajectories."
Kinematic artifacts: Undesirable motion defects arising from purely geometric motion processing without dynamics. "introduces kinematic artifacts—foot sliding, self-penetration, and joint-limit violations—"
Kinematic planner: A planner that generates reference motions based on kinematic constraints rather than full dynamics. "dark_gray treats the motion modes of a kinematic planner as composable building blocks"
Kinematic reference motion: Short-horizon target motion produced by the planner for the controller to track. "The planner consumes this command together with the current robot state and produces a kinematic reference motion"
Kinematic retargeting: Adapting captured human motion to a different morphology (e.g., a robot) using kinematic mapping. "human motion capture (MoCap) followed by kinematic retargeting to the target morphology~\cite{luo2023perpetual,peng2018deepmimic}."
Loco-manipulation: Tasks involving coordinated whole-body locomotion and manipulation. "High-quality reference motions are essential for training humanoid robots to perform whole-body loco-manipulation tasks."
MoCap: Abbreviation for motion capture, the process of recording human movements for animation or robotics. "human motion capture (MoCap)"
Mode switching: Changing the active motion mode during a sequence to achieve transitions between behaviors. "Composability arises from mode switching: when the bridge sends a new mode index mid-session, the resulting motion transitions smoothly from the current state to the new behavior"
Motion primitive: A parameterized, self-contained behavior unit that can be chained with others. "This parameterization renders each mode a self-contained motion primitive that can be composed with any other mode to form longer sequences."
Motion stitching: Combining multiple, distinct motion segments into a continuous trajectory with smooth transitions. "Motion stitching. dark_gray enables motion stitching across semantically distinct motion modes."
MuJoCo: A physics engine for detailed, efficient rigid-body simulation widely used in robotics. "A low-level whole-body controller tracks the planner's kinematic references in MuJoCo simulation, producing physically grounded trajectories recorded at 50\,Hz."
Orchestration: Coordinating processes and message flows to execute recipes and manage trajectories. "a WebSocket–ZMQ interface, acts as the central coordination layer. It translates communication protocols, maintains the planner state, aggregates telemetry into a unified state representation, and orchestrates command sequencing and trajectory management."
Pelvis height: A controllable parameter adjusting the vertical position of the robot’s pelvis to vary posture. "each parameterized by movement, heading, speed, pelvis height and duration"
Physics-based tracking: Using a physics controller to follow kinematic references, ensuring physical feasibility. "an additional physics-based tracking step~\cite{luo2023perpetual} is typically required before deployment on a real robot."
Planner-agnostic: Designed to work with any motion-generation backend without changing the rest of the system. "with a planner-agnostic architecture that decouples data collection from the choice of motion-generation backend."
Retargeting: Mapping motion from one character or robot to another, often across different morphologies. "no retargeting step is needed— the recorded joint trajectories are directly in the robot's configuration space."
Self-penetration: An artifact where body parts intersect unrealistically due to kinematic inconsistencies. "foot sliding, self-penetration, and joint-limit violations—"
Skeleton (Unitree G1 skeleton): The kinematic structure (joints and links) defining the robot’s body for motion generation. "the planner operates natively on the Unitree G1 skeleton"
Strafing: Sideways translational movement without changing facing direction. "the comma and period keys trigger lateral strafing"
Stylistic register: The linguistic style of the generated descriptions (e.g., instruction, natural, narrative, concise). "The first factor controls register: instruction (imperative, e.g.\ Walk forward for 3.0 seconds''), natural (adverbial, e.g.\Stride ahead briskly for about 3.0 seconds''), narrative (third-person, e.g.\ The robot marches forward for 3.0 seconds''), and concise (keyword-only, e.g.\walk forward 3.0s'')."
Telemetry: Real-time status data and measurements streamed from controllers and simulators. "maintains the planner state, aggregates telemetry into a unified state representation"
Text-to-motion: Generating motion from textual prompts using learned models. "text-to-motion generative models produce purely kinematic outputs that are not guaranteed to be physically feasible."
Time-aligned: Synchronized in time so motion and language correspond segment-by-segment. "yielding time-aligned trajectory and language data."
Unitree G1: A specific humanoid robot platform used as the target for motion generation and control. "language-annotated whole-body motion data for the Unitree G1 humanoid robot."
WebSocket--ZMQ bridge: A middleware component translating between WebSocket and ZMQ protocols for low-latency control. "both streaming commands at 20\,Hz to a kinematic planner via a WebSocket--ZMQ bridge."
Whole-body controller: A controller that coordinates all joints to track desired motions across the entire robot body. "A low-level whole-body controller tracks the planner's kinematic references in MuJoCo simulation"
ZMQ (ZeroMQ): A high-performance asynchronous messaging library used for inter-process communication. "The bridge forwards planner commands to the controller via ZMQ at 20Hz."

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

CLAW: Composable Language-Annotated Whole-body Motion Generation

Summary

CLAW: Composable Language-Annotated Whole-body Motion Generation

Motivation and Contributions

System Design and Architecture

Composable Motion Primitives and User Interfaces

Language Annotation Engine

Experimental Findings

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions are the researchers trying to answer?

How does dark_gray work? (Explained simply)

What did they find, and why is it important?

What could this mean for the future?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets