CLAW: Composable Language-Annotated Whole-body Motion Generation
Abstract: Training language-conditioned whole-body controllers for humanoid robots requires large-scale datasets pairing motion trajectories with natural-language descriptions. Existing approaches based on motion capture are costly and limited in diversity, while text-to-motion generative models produce purely kinematic outputs that are not guaranteed to be physically feasible. Therefore, we present CLAW, an interactive web-based pipeline for scalable generation of language-annotated whole-body motion data for the Unitree G1 humanoid robot. CLAW treats the motion modes of a kinematic planner as composable building blocks, each parameterized by movement, heading, speed, pelvis height and duration, and provides two browser-based interfaces -- a real-time keyboard mode and a timeline-based sequence editor -- for exploratory and batch data collection. A low-level whole-body controller tracks the planner's kinematic references in MuJoCo simulation, producing physically grounded trajectories recorded at 50Hz. Simultaneously, a deterministic template-based annotation engine generates diverse natural-language descriptions at multiple stylistic registers for every segment and for the full trajectory. We release the system as open source to support scalable generation of language-motion paired data for humanoid robot learning.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper introduces dark_gray, a web tool that helps people quickly create lots of “robot moves” for a human‑shaped robot (the Unitree G1) and automatically writes short, clear sentences that describe each move. Think of it like a video game level editor for robot motions, where every action the robot performs also gets its own caption. The goal is to make big, high‑quality datasets that can teach robots to follow natural‑language instructions with their whole body.
What questions are the researchers trying to answer?
Here are the main ideas they explore:
- How can we make large, diverse datasets of robot motions without the cost and hassle of recording human actors with motion‑capture suits?
- How can we be sure these motions are physically realistic (not just good‑looking animations)?
- How can we automatically pair each motion with clear natural‑language descriptions so robots can learn to understand instructions?
How does dark_gray work? (Explained simply)
You can think of dark_gray as a “robot motion studio” with four main parts working together:
- A browser interface (what you see): Two ways to create motions
- Keyboard mode: Like controlling a game character with keys to make the robot walk, turn, crouch, punch, and more in real time.
- Editor mode: Like a timeline in a video editor where you drag and drop motion clips (walk → turn → crawl, etc.) to build longer sequences you can replay exactly.
- A planner (the choreographer): It turns your high‑level choices (like “run forward, turn left, go slow, keep hips low”) into a short plan for how the robot should move next.
- A controller + physics simulator (the “body” and the “physics sandbox”):
- Whole‑body controller: Ensures the robot’s joints and balance follow the plan.
- MuJoCo simulator: A physics world that makes sure the movement would make sense in real life—no feet sliding or impossible poses.
- A language engine (the caption writer): It uses templates and synonyms to automatically write multiple styles of descriptions for each motion segment and for the whole sequence, like:
- Instruction style: “Walk forward for 3 seconds.”
- Story style: “The robot strides ahead for 3 seconds.”
- Concise style: “walk forward 3s”
Some helpful translations of technical terms:
- Motion primitive: A basic move, like “walk,” “run,” “turn,” “squat,” “punch,” or “crawl.” These are building blocks you can stack to make longer routines.
- Kinematic planner: A tool that figures out the shapes and paths of motions (like a choreographer), without worrying about forces yet.
- Whole‑body controller: A low‑level system that makes the robot actually follow the plan smoothly and safely.
- Physics simulation (MuJoCo): A realistic “virtual world” where gravity, friction, and balance are checked—like testing a move in a safe sandbox before trying it for real.
What did they find, and why is it important?
The paper reports three main results:
- Smoothly mixing different moves works well most of the time
- You can switch between different motion primitives—like walk → squat → crawl → punch—and the system often creates natural transitions without any manual “blending.” This makes long, interesting motion sequences possible.
- Sometimes, switching between very different poses (for example, crawl → punch) can look awkward. The authors note this as a challenge to improve.
- Automatic language descriptions are accurate and varied
- For every segment and entire sequence, the system produces multiple descriptions in different styles, using templates and synonym lists. Because it’s template‑based, it’s consistent and repeatable—great for training AI systems that need reliable labels.
- Scales to lots of data, fast
- No motion‑capture sessions are needed, which saves time and money.
- The editor can batch‑generate endless motion sequences with small changes (like different speeds, directions, or styles), creating a big, diverse training set.
- Motions are already in the robot’s own joint format, so there’s no “retargeting” step—another time saver.
Why this matters:
- Training robots to understand and follow language commands (like “slowly crouch, then crawl forward, then stand and turn right”) needs many examples of motion paired with matching text. dark_gray makes those pairs quickly, cheaply, and with physics realism.
What could this mean for the future?
- Faster robot learning: Researchers can build huge, high‑quality datasets to train robots that follow spoken or written instructions with their whole body.
- More reliable motions: Because the motions are tested in physics, they’re more likely to work on real robots.
- Easy to extend: The system is modular, so you can swap in different motion planners or add more move types later.
- Next steps: The authors suggest adding LLMs for even richer descriptions, letting users define new motion primitives, and automatically detecting bad transitions to keep quality high.
In short, dark_gray is like a motion “recipe maker” and caption “writer” for humanoid robots. It helps researchers quickly create and label realistic, varied movements—exactly what’s needed to teach robots to understand and carry out human instructions.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper.
- No real-world validation: the pipeline is demonstrated only in MuJoCo; there is no evaluation of sim-to-real transfer on a physical Unitree G1 (e.g., tracking fidelity, failure modes, latency/actuation limits).
- Annotation–execution mismatch: language is derived from intended planner parameters rather than realized motion; the system does not verify or adjust text to reflect actual executed speeds, headings, durations, or path deviations.
- Motion quality is unquantified: there are no metrics reported for foot slip, contact consistency, joint-limit/torque-limit violations, COM/ZMP stability, jerk, or smoothness across transitions.
- Transition failures are not characterized: the frequency, severity, and conditions of abrupt transitions between mode pairs are unmeasured; no automated detector or filter is implemented to flag and remove low-quality transitions at scale.
- Dataset coverage is unspecified: the paper provides no statistics on the distribution of modes, parameter ranges (speed, heading, height), sequence lengths, or transition graphs; sampling policies for broad coverage are not defined.
- Impact on downstream learning is untested: there is no evidence that training with the generated motion–language data improves language-conditioned control compared to MoCap or text-to-motion baselines, either in simulation or on hardware.
- Planner-agnostic claim is unverified: only the SONIC backend is used; integration with alternative planners/controllers is not demonstrated or benchmarked.
- Fixed, limited behavior set: only 25 pre-defined modes are available, with no mechanism or API shown for adding user-defined primitives, learned skills, or task-specific behaviors.
- Environment impoverishment: motions are generated on flat terrain without obstacles, stairs, uneven friction, or external objects (despite an “Object Carrying” style), limiting ecological validity and contact diversity.
- Long-horizon robustness untested: stability over very long sequences (drift, cumulative tracking error, thermal/energy constraints, contact fatigue) is not measured.
- Annotation expressivity is narrow: text encodes only mode, movement, heading, speed, height, and duration—omitting richer semantics such as distance traveled, absolute/relative positions, contact events, footfall timing, or task goals.
- Language quality is unevaluated: there is no human or automatic assessment of fluency, naturalness, correctness, or stylistic variety; the effect of deterministic synonym banks on perceived diversity is unknown.
- Temporal referencing choices are unchecked: the heavy reliance on durations (rather than human-salient distances or landmarks) is not validated with user studies or perception experiments.
- No language–motion alignment metric: the paper lacks quantitative measures (e.g., alignment error between described and realized velocities/turn magnitudes, or text consistency via QA models) to ensure annotation fidelity.
- Multilingual support is absent: the system is English-only; there is no plan or evaluation for multilingual templates or cross-lingual consistency.
- Scaling characteristics are unreported: throughput (trajectories/hour), compute and memory footprint, and I/O bottlenecks for large-batch generation are not benchmarked; distributed/cloud deployments are unexplored.
- Data packaging and standards are undefined: the schema, metadata, and interoperability with common datasets/formats (e.g., for contacts, forces, events) are not specified; no public dataset release or statistics are provided.
- No automatic curation: there is no pipeline for deduplication, diversity-aware sampling, or balancing across modes/parameters to avoid over-represented behaviors.
- Domain randomization is missing: physics and rendering parameters (friction, mass, sensor noise) are not randomized to support sim-to-real robustness.
- Safety and feasibility constraints are unspecified: limits on joint velocities, torques, contact forces, and controller saturation are not enforced or reported during generation.
- UI usability is unassessed: there is no user study on the keyboard/editor interfaces (speed of authoring, error rates, learnability), and no guidance on how operator bias affects data distributions.
- Compositional language complexity is limited: annotations do not include coreference, anaphora, or higher-level task semantics across segments; it’s unclear whether such simplified language suffices for training models that generalize to natural free-form instructions.
- Parameter-to-language calibration is ad hoc: mappings from continuous parameters to adverbs/phrases (e.g., “at full speed”) are heuristic and not grounded in human perception; boundary effects and ambiguity are not analyzed.
- Generalization across embodiments is untested: although claimed backend-agnostic, there is no evidence the approach transfers to different humanoids or morphologies without re-tuning.
- Comparisons to alternative data sources are missing: the pipeline is not benchmarked against MoCap-based or text-to-motion datasets in terms of diversity, realism, or downstream policy performance.
Practical Applications
Immediate Applications
Below are concrete, deployable use cases that can be implemented with the system as described, leveraging its composable motion primitives, deterministic language annotations, and modular, planner-agnostic architecture.
- Scalable generation of language–motion datasets for humanoid control
- Description: Use the browser editor or keyboard mode to synthesize large, reproducible corpora of physically grounded whole-body trajectories (50 Hz) paired with multi-style, deterministic natural-language descriptions.
- Sectors: Robotics, Software/AI tooling, Academia.
- Tools/workflows: dark_gray sequence editor and keyboard UI, WebSocket–ZMQ bridge, MuJoCo simulation, template-based annotation engine; “recipe-to-dataset” batch pipelines; seed/version-controlled dataset builds.
- Assumptions/dependencies: MuJoCo license and compute; access to a compatible kinematic planner/whole-body controller (e.g., SONIC); coverage limited to the 25 included motion modes; template-based language may limit linguistic richness.
- Pretraining and finetuning of language-conditioned policies
- Description: Train supervised or offline RL policies to map language to kinematic references or joint-space targets using physically simulated, time-aligned data.
- Sectors: Robotics (humanoids), AI research.
- Tools/workflows: Data loaders for joint trajectories + text; curriculum generation by parameter sweeps (speed, heading, pelvis height); reproducible seeds for ablation studies.
- Assumptions/dependencies: Sim-to-real gap remains; controller fidelity and contact modeling quality in MuJoCo affect transfer; annotations cover semantics of motion primitives, not arbitrary tasks.
- Rapid motion prototyping and choreography for demos
- Description: Compose and preview long-horizon motion sequences (e.g., walk → turn → kneel → crawl → stand) for trade shows, lab demos, or proof-of-concept videos.
- Sectors: Robotics industry, Media/entertainment (previsualization).
- Tools/workflows: Keyboard mode for exploration, timeline editor for reproducible routines; export of sequences and accompanying language scripts.
- Assumptions/dependencies: On-robot execution requires robust tracking and safety constraints; some inter-mode transitions may be abrupt and should be manually filtered.
- Benchmarking and regression testing for planners/controllers
- Description: Establish fixed “recipes” to repeatedly evaluate controller robustness, transition smoothness, gait stability, and failure transitions across software revisions.
- Sectors: Robotics R&D, Software QA.
- Tools/workflows: Deterministic scenario suites, KPI metric harnesses (e.g., foot slip, COM excursions, tracking error), automated pass/fail gating.
- Assumptions/dependencies: Requires metric definitions and logging; failure modes noted in the paper (e.g., crawl → boxing) should be included as stress tests.
- Education and training in whole-body control and HRI
- Description: Use the browser-based interfaces in classrooms to teach kinematics, locomotion, whole-body tracking, and language grounding without motion-capture infrastructure.
- Sectors: Education (university courses, bootcamps).
- Tools/workflows: Instructor-provided recipes, parameter variation exercises, reproducible labs; student assignments on data generation and model training.
- Assumptions/dependencies: Classroom access to MuJoCo-compatible machines; faculty-developed teaching materials.
- Internal “dataset factory” for humanoid teams
- Description: Run simulator farms that generate language–motion pairs at scale, varying parameters systematically for coverage.
- Sectors: Robotics/AI companies.
- Tools/workflows: Containerized workers; job queues that sample motion modes, seeds, and parameters; provenance tracking (seed, config, version).
- Assumptions/dependencies: Orchestration and storage infrastructure; cost control for compute and licensing.
- Controlled HRI studies on language grounding
- Description: Investigate how linguistic register and temporal grounding (with/without durations) influence policy learning and interpretability.
- Sectors: Academia (HRI, NLP + Robotics).
- Tools/workflows: Eight-style deterministic annotations; ablation across registers; evaluation of instruction-following accuracy.
- Assumptions/dependencies: Template-driven phrasing may bias outcomes; content is constrained to supported primitives.
- Synthetic data for vision/pose pipelines (optional rendering)
- Description: Render camera feeds from MuJoCo alongside ground-truth kinematics and aligned language to train cross-modal perception or captioning models.
- Sectors: Computer vision, Multimodal AI.
- Tools/workflows: Simulator camera rigs; export of RGB + depth + joint states + text; multimodal dataset packaging.
- Assumptions/dependencies: Rendering integration and domain randomization required; pipeline currently centers on motion/language, not photorealistic visuals.
- Audit-ready documentation of robot motion
- Description: Log executed recipes with deterministic language summaries for internal compliance and traceability in development pipelines.
- Sectors: Industrial QA/compliance.
- Tools/workflows: Runbooks that pair seeds, configs, and natural-language summaries; reproducibility checks in CI.
- Assumptions/dependencies: Internal policies and documentation standards; alignment with future regulatory requirements.
- Content previsualization for robot-centric experiences
- Description: Quickly generate physically plausible motion beats for robot performances or mixed reality experiences before transferring key motions to on-robot controllers.
- Sectors: Entertainment, Events.
- Tools/workflows: Sequence editor for beatboards; exported timings and story text from annotations.
- Assumptions/dependencies: Export formats to DCC tools may need adapters; fine motion styling limited by available modes.
Long-Term Applications
These opportunities require additional research, scaling, backend extensions, or sim-to-real advances before broad deployment.
- Language-driven whole-body control on real humanoids (sim-to-real)
- Description: Train policies that execute verbal instructions end-to-end on physical robots (e.g., Unitree G1) using synthetic pretraining plus on-robot finetuning.
- Sectors: Robotics (industrial, service), Daily life (future consumer robots).
- Potential products/workflows: On-robot instruction-following stacks; safety-verified deployment pipelines; closed-loop synthetic→real training regimes.
- Assumptions/dependencies: High-fidelity contact modeling; robust whole-body controllers; perception and safety systems; rigorous sim-to-real adaptation.
- Cross-platform, standardized language–motion corpora and benchmarks
- Description: Multi-vendor datasets and task suites for evaluating language-conditioned humanoid control, supporting regulatory and interoperability goals.
- Sectors: Policy/standards, Industry consortia, Academia.
- Potential products/workflows: Benchmark task libraries, public leaderboards, standardized annotation schemas and seeds for reproducibility.
- Assumptions/dependencies: Broad adoption across robot morphologies; extensible primitive sets; governance for data quality and safety.
- Natural-language programming of robots (programming-by-description)
- Description: Non-experts specify tasks in everyday language that compile to sequences of composable whole-body skills.
- Sectors: Industrial automation, Facilities operations, Daily life (assistive robots).
- Potential products/workflows: NL-to-skill compilers; operator UIs; verification layers that simulate and validate before execution.
- Assumptions/dependencies: Expanded primitive libraries (manipulation, tool use); robust semantic parsing; safety and failover policies.
- LLM-augmented, multilingual, and richer annotation ecosystems
- Description: Enhance template annotations with LLM paraphrasing, multilingual support, and task-level narratives for broader generalization.
- Sectors: NLP+Robotics, Global deployments.
- Potential products/workflows: Annotation plugins with controllable variation; bias/fairness auditing; language coverage expansion.
- Assumptions/dependencies: Managing nondeterminism; preventing hallucinations and drift; data governance; compute costs.
- Automatic transition QA and motion quality scoring
- Description: Learn metrics and detectors for low-quality transitions to auto-filter or penalize problematic mode pairs during dataset generation.
- Sectors: Robotics MLOps, QA tooling.
- Potential products/workflows: Transition scorecards, failure case discovery services, CI gates for motion quality.
- Assumptions/dependencies: Ground-truth labels or proxy metrics; calibration across robot morphologies.
- Extensible primitive marketplaces and learned skills
- Description: Integrate user-defined and learned primitives (e.g., manipulation, terrain negotiation) into the compositional framework; shareable skill libraries.
- Sectors: Robotics software ecosystem, App marketplaces.
- Potential products/workflows: Skill packaging and distribution; versioned skill registries; safe composition validators.
- Assumptions/dependencies: Interfaces for skill specification; verification in sim; IP/licensing frameworks.
- Facility-scale digital twins for robot readiness
- Description: Combine composable whole-body motion with environment models to stress-test navigation and posture transitions before deployment.
- Sectors: Logistics, Manufacturing, Energy.
- Potential products/workflows: Scenario generators; hazard simulations; readiness reports for task rollouts.
- Assumptions/dependencies: High-fidelity environment and interaction simulation; integration with perception and mapping.
- Assistive and healthcare robotics motion training
- Description: Train safe, compliant whole-body behaviors (e.g., careful walking, kneeling, posture transitions) as steps toward assistive applications.
- Sectors: Healthcare, Elder care.
- Potential products/workflows: Safety-constrained policy learning; compliance verification; clinician-in-the-loop validation workflows.
- Assumptions/dependencies: Human–robot contact modeling; stringent safety/regulatory approvals; ethical frameworks.
- Disaster response and defense training
- Description: Prepare language-directed agility (e.g., crawling, cautious gait) for complex, unstructured terrains and mission scripting.
- Sectors: Public safety, Defense.
- Potential products/workflows: Scenario packs (rubble, confined spaces); mission-level language planning; robust execution monitors.
- Assumptions/dependencies: Terrain/interaction realism; ruggedized hardware; secure operational protocols.
- Synthetic data marketplaces with provenance guarantees
- Description: Commercialize deterministic, seed-reproducible language–motion datasets with traceable provenance for industrial buyers.
- Sectors: AI data markets, Robotics vendors.
- Potential products/workflows: Dataset SKUs by mode coverage and parameter ranges; audit trails; licensing services.
- Assumptions/dependencies: Legal/ethical compliance; curation and QA; alignment with emerging data standards.
Notes on feasibility across applications:
- Core dependencies include a compatible kinematic planner and whole-body controller (SONIC in the paper), MuJoCo-based physics fidelity, and the current library of 25 motion modes.
- Deterministic, template-based annotations ensure reproducibility but may limit linguistic diversity and domain transfer; LLM augmentation can improve breadth at the cost of determinism and curation burden.
- Some inter-mode transitions show artifacts; quality gates and automated QA will be important for scaling to safety-critical deployments.
- Sim-to-real transfer remains the primary long-term constraint for real-world execution, requiring advances in control, perception, contact modeling, and safety engineering.
Glossary
- AMASS: A large-scale human motion capture dataset providing unified, high-coverage motion data for learning and synthesis. "Large-scale MoCap datasets such as AMASS~\cite{mahmood2019amass} have greatly expanded the coverage of human motion,"
- Annotation engine: A system that automatically turns structured motion parameters into natural-language descriptions. "A template-based annotation engine simultaneously produces diverse natural-language descriptions from the same motion parameters, yielding time-aligned trajectory and language data."
- Black-box motion-primitive engine: A component treated purely by its inputs/outputs that generates reference motions from high-level commands. "dark_gray treats the kinematic planner as a black-box motion-primitive engine."
- Composable primitives: Modular motion behaviors that can be combined to form longer, diverse trajectories. "dark_gray treats the motion modes of a kinematic planner as composable primitives, each governed by parameters like movement, heading, speed, pelvis height and duration."
- Configuration space: The space of all possible joint configurations of the robot used to represent motions directly. "the recorded joint trajectories are directly in the robot's configuration space."
- Control cycle: A single iteration of the control loop in which commands are processed and applied. "Because every keypress is translated into a planner command within a single control cycle, keyboard mode is well suited for free-form exploration and rapid prototyping of novel motion sequences."
- Diffusion models: Generative models that iteratively denoise samples to synthesize data, here used for text-conditioned motion generation. "text-to-motion diffusion models like Kimodo~\cite{rempe2026kimodo} are trained to synthesize diverse whole-body movements from language prompts."
- Dynamic trajectories: Motions produced by a physics simulator in response to control inputs, capturing physically consistent dynamics. "kinematic reference trajectories (the planner's output) and dynamic trajectories (the MuJoCo-simulated response to those references)."
- Embodiments (disparate embodiments): Different physical forms or morphologies that can complicate transferring motions between systems. "retargeting across disparate embodiments introduces kinematic artifacts—foot sliding, self-penetration, and joint-limit violations—"
- Foot sliding: An artifact where the feet appear to slip unrealistically during contact due to kinematic inconsistencies. "foot sliding, self-penetration, and joint-limit violations—"
- Gait style: The qualitative pattern of locomotion (e.g., walking nuances) as a controllable motion attribute. "pelvis height and gait style—properties that are critical for systematic data augmentation."
- Heading: The facing direction of the robot independent of its movement direction. "each parameterized by movement, heading, speed, pelvis height and duration"
- Joint-limit violations: Motions that exceed the mechanical range of joints, causing unphysical or unsafe configurations. "foot sliding, self-penetration, and joint-limit violations—"
- Joint trajectories: Time series of joint angles/velocities generated by tracking or planning. "to yield physically grounded joint trajectories."
- Kinematic artifacts: Undesirable motion defects arising from purely geometric motion processing without dynamics. "introduces kinematic artifacts—foot sliding, self-penetration, and joint-limit violations—"
- Kinematic planner: A planner that generates reference motions based on kinematic constraints rather than full dynamics. "dark_gray treats the motion modes of a kinematic planner as composable building blocks"
- Kinematic reference motion: Short-horizon target motion produced by the planner for the controller to track. "The planner consumes this command together with the current robot state and produces a kinematic reference motion"
- Kinematic retargeting: Adapting captured human motion to a different morphology (e.g., a robot) using kinematic mapping. "human motion capture (MoCap) followed by kinematic retargeting to the target morphology~\cite{luo2023perpetual,peng2018deepmimic}."
- Loco-manipulation: Tasks involving coordinated whole-body locomotion and manipulation. "High-quality reference motions are essential for training humanoid robots to perform whole-body loco-manipulation tasks."
- MoCap: Abbreviation for motion capture, the process of recording human movements for animation or robotics. "human motion capture (MoCap)"
- Mode switching: Changing the active motion mode during a sequence to achieve transitions between behaviors. "Composability arises from mode switching: when the bridge sends a new mode index mid-session, the resulting motion transitions smoothly from the current state to the new behavior"
- Motion primitive: A parameterized, self-contained behavior unit that can be chained with others. "This parameterization renders each mode a self-contained motion primitive that can be composed with any other mode to form longer sequences."
- Motion stitching: Combining multiple, distinct motion segments into a continuous trajectory with smooth transitions. "Motion stitching. dark_gray enables motion stitching across semantically distinct motion modes."
- MuJoCo: A physics engine for detailed, efficient rigid-body simulation widely used in robotics. "A low-level whole-body controller tracks the planner's kinematic references in MuJoCo simulation, producing physically grounded trajectories recorded at 50\,Hz."
- Orchestration: Coordinating processes and message flows to execute recipes and manage trajectories. "a WebSocket–ZMQ interface, acts as the central coordination layer. It translates communication protocols, maintains the planner state, aggregates telemetry into a unified state representation, and orchestrates command sequencing and trajectory management."
- Pelvis height: A controllable parameter adjusting the vertical position of the robot’s pelvis to vary posture. "each parameterized by movement, heading, speed, pelvis height and duration"
- Physics-based tracking: Using a physics controller to follow kinematic references, ensuring physical feasibility. "an additional physics-based tracking step~\cite{luo2023perpetual} is typically required before deployment on a real robot."
- Planner-agnostic: Designed to work with any motion-generation backend without changing the rest of the system. "with a planner-agnostic architecture that decouples data collection from the choice of motion-generation backend."
- Retargeting: Mapping motion from one character or robot to another, often across different morphologies. "no retargeting step is needed— the recorded joint trajectories are directly in the robot's configuration space."
- Self-penetration: An artifact where body parts intersect unrealistically due to kinematic inconsistencies. "foot sliding, self-penetration, and joint-limit violations—"
- Skeleton (Unitree G1 skeleton): The kinematic structure (joints and links) defining the robot’s body for motion generation. "the planner operates natively on the Unitree G1 skeleton"
- Strafing: Sideways translational movement without changing facing direction. "the comma and period keys trigger lateral strafing"
- Stylistic register: The linguistic style of the generated descriptions (e.g., instruction, natural, narrative, concise). "The first factor controls register: instruction (imperative, e.g.\
Walk forward for 3.0 seconds''), natural (adverbial, e.g.\Stride ahead briskly for about 3.0 seconds''), narrative (third-person, e.g.\The robot marches forward for 3.0 seconds''), and concise (keyword-only, e.g.\walk forward 3.0s'')." - Telemetry: Real-time status data and measurements streamed from controllers and simulators. "maintains the planner state, aggregates telemetry into a unified state representation"
- Text-to-motion: Generating motion from textual prompts using learned models. "text-to-motion generative models produce purely kinematic outputs that are not guaranteed to be physically feasible."
- Time-aligned: Synchronized in time so motion and language correspond segment-by-segment. "yielding time-aligned trajectory and language data."
- Unitree G1: A specific humanoid robot platform used as the target for motion generation and control. "language-annotated whole-body motion data for the Unitree G1 humanoid robot."
- WebSocket--ZMQ bridge: A middleware component translating between WebSocket and ZMQ protocols for low-latency control. "both streaming commands at 20\,Hz to a kinematic planner via a WebSocket--ZMQ bridge."
- Whole-body controller: A controller that coordinates all joints to track desired motions across the entire robot body. "A low-level whole-body controller tracks the planner's kinematic references in MuJoCo simulation"
- ZMQ (ZeroMQ): A high-performance asynchronous messaging library used for inter-process communication. "The bridge forwards planner commands to the controller via ZMQ at 20Hz."
Collections
Sign up for free to add this paper to one or more collections.

