World Grounding with Pretrained Skills
- World grounding is the process of mapping high-level symbolic logic to raw sensory data, enabling seamless task execution in embodied environments.
- Pretrained skills, derived from large language and vision models, are integrated into controllers to enable zero-shot transfer and adaptive behavior in novel contexts.
- Empirical frameworks validate these systems through probabilistic verification and real-world benchmarks, ensuring robust performance under perceptual uncertainties.
World grounding with pretrained skills refers to the process of linking the high-level knowledge, reasoning, or behavior encoded in large pretrained models—such as language, vision, or foundation models—to actionable perception and control in real-world or embodied environments. This paradigm integrates structured knowledge, open-vocabulary semantics, and pretrained skills into executable policies or controllers with direct connections to sensory observations and physical actuation. The resulting systems not only comprehend task objectives at a symbolic level but are also verifiably executable in the presence of perception and actuation uncertainties.
1. Key Principles and Definitions
World grounding denotes the mapping between symbolic or high-level task logic—often specified in natural language or formal task representations—and raw, multimodal observations or action spaces in an embodied environment. Pretrained skills are reusable capabilities or behaviors, often encoded as policies, controllers, or semantic skill abstractions derived from large pretrained models (LLMs, vision-LLMs, motion capture, imitation learning, etc.).
The central challenge is to leverage frozen or minimally adapted pretrained models (language, vision, multimodal) to synthesize, select, or compose such skills, while ensuring their applicability and robustness in diverse, possibly previously unseen, real-world contexts. The grounding mechanism must resolve four complementary requirements:
- Mapping high-level task descriptions to executable skills or controllers.
- Connecting symbolic propositions or action candidates to sensory observations in real time.
- Verifying whether the planned policy satisfies user intent or safety objectives, possibly under uncertainty or domain shift.
- Enabling adaptation, transfer, or composition of skills to novel objects, domains, or environments without retraining.
2. Grounding Architectures and Methodologies
Contemporary frameworks for world grounding deploy a variety of system architectures that combine pretrained models with tailored components for task decomposition, perception, planning, and execution. Representative methodologies include:
- Automaton-based controllers from LLMs: Task descriptions are parsed via generative LMs (GLMs) into sequences of symbolic actions, each with associated preconditions and effects, which are assembled into finite-state automata encoding the control logic. Observations are linked to symbolic propositions by querying vision-LLMs (VLMs) to score the truth values of atomic predicates in real images. This allows for the construction, verification, and visual grounding of verifiable controllers from natural language, e.g., the GLMtoFSA pipeline (Yang et al., 2023).
- Grounded task-axis controllers (GTACs): Manipulation skills are decomposed into prioritized stacks of low-level controllers, each parameterized by keypoints and control axes, which are semantically grounded in the geometry of target objects using vision foundation models (e.g., SD-DINO). This enables zero-shot skill transfer by matching reference keypoints and axes to semantically similar structures on novel objects (Seker et al., 16 May 2025).
- Reinforcement and symbolic planning integration: Bidirectional frameworks such as SCALAR generate and iteratively refine skill libraries using LLMs to propose symbolic skills (with preconditions and effects), while deep RL is used to ground these high-level descriptions into executable policies via reward shaping and trajectory analysis, closing the planning–execution loop (Zabounidis et al., 10 Mar 2026).
- Multimodal grounding in navigation and control: Skill libraries parameterized by natural language instructions and latent policy embeddings can be composed or blended in real time, conditioned on perceptual input (via VLM feature extraction) and/or high-level instructions. Real-time inference fuses cross-modal cues for robust, adaptive behavior (Nahrendra et al., 11 Feb 2026, Shen et al., 24 Jan 2026).
- Probabilistic formal verification and specification satisfaction: Controllers constructed from pretrained models are formally verified against task and safety specifications (e.g., in LTL), using automata-theoretic model checking. Probabilistic guarantees are established to account for perceptual uncertainty stemming from imperfect VLM-based grounding; overall satisfaction probabilities can be bounded analytically (Yang et al., 2023).
3. Perception, Affordance, and Symbolic-to-Sensor Mapping
A crucial aspect of world grounding is the connective interface between symbolic task logic or skill representations and sensor data. Several strategies have emerged:
- Vision-language proposition evaluation: Each atomic proposition about the environment is scored by an open-vocabulary VLM for its likelihood given current visual observations. Thresholding or uncertainty handling (e.g., via self-looping “noop” automaton transitions) accommodates perceptual ambiguity (Yang et al., 2023).
- Affordance-based skill selection: In robotic instruction following, each candidate skill is accompanied by an affordance model (often a value function or Q-function) estimating real-world feasibility; combined with LLM-derived sub-task proposals, this enables contextually grounded skill selection (Ahn et al., 2022, Shin et al., 2024).
- Geometry-aware foundation model grounding: Keypoints and axes for controllers are identified on target objects via the embedding similarity between reference and observed image features, enabling precise geometric alignment for skill execution without retraining (Seker et al., 16 May 2025).
- Adaptive cross-modal alignment: Occupancy prediction or visual-language grounding frameworks employ adapters to map 3D or visual embeddings into VLM spaces, closing modality gaps and enabling open-world generalization across known/unknown classes (Li et al., 14 Apr 2025, Zhang et al., 9 Mar 2026, Peng et al., 2023).
4. Skill Transfer, Hierarchical Decomposition, and Composition
Skill transfer and hierarchical decomposition techniques are critical for grounding high-level tasks in complex or dynamic environments:
- Hierarchical skill libraries: Semantic skills are organized in a multi-level hierarchy from low-level atomic skills to high-level semantic composites. Grounding is performed via iterative decomposition, LM-guided subtask generation, and multi-modal feasibility assessment, allowing transfer to new domains with minimal adaptation. This yields substantial success rate gains in cross-domain embodied instruction following (Shin et al., 2024).
- Zero-shot and few-shot transfer: Controllers grounded using semantic features—rather than category-specific mappings—can be applied to previously unseen object instances or even novel object categories with high success rates (∼90% for pan scraping, screwing, and pouring tasks) and sub-centimeter keypoint accuracy (Seker et al., 16 May 2025). Few-shot bootstrapping further extends out-of-distribution coverage in 3D scene perception (Li et al., 14 Apr 2025).
- Blended skill composition: Motion or manipulation policies may be synthesized by fusing the outputs of multiple pretrained expert policies, where the fusion weights are determined by the semantic planner (VLM) and dynamic selectors (state encoders), thereby supporting robust adaptation during deployment (Shen et al., 24 Jan 2026).
5. Formal Guarantees, Verification, and Empirical Validation
Robustness and formal correctness are often addressed through verification and analytical performance bounds:
- Probabilistic correctness guarantees: Combining the reliability of perception (maximum error rate δ) with the intrinsic correctness of the controller (error ε_ctrl), one can analytically bound the probability of task failure as no greater than ε_ctrl + N⋅δ, where N is the number of proposition evaluations per execution (Yang et al., 2023).
- Ablation and evaluation: Empirical studies validate the effectiveness of grounded controllers and skill libraries in both simulation (e.g., Craftax for skill grounding, Humanoid-Bench for motion composition) and real-world settings (e.g., robotic kitchen tasks, RealVLG manipulation benchmarks). Failure mode analyses identify the impact of LLM mis-planning, affordance/model misestimation, and domain shift (Zabounidis et al., 10 Mar 2026, Shen et al., 24 Jan 2026, Li et al., 16 Mar 2026, Ahn et al., 2022).
- Adaptation and specification refinement: Automatic refinement (e.g., controller branching in response to counterexamples, pivotal trajectory analysis for skill correction) can iteratively improve the alignment between planned skills and real-world performance (Yang et al., 2023, Zabounidis et al., 10 Mar 2026).
6. Limitations, Challenges, and Future Directions
Despite considerable progress, several open limitations and future directions are explicitly noted:
- Perceptual gaps: State-of-the-art VLMs are comparatively weak at dynamic scene understanding and active/perceptual queries (e.g., video or temporal relations) (Yang et al., 2023).
- Symbolic/semantic-physical gap: Not all systems fully bridge high-level semantic reasoning with low-level motor primitives; mappings from symbolic actions to motor commands often require further model queries (Shen et al., 24 Jan 2026).
- Computational overhead: Large VLMs and LLMs impose nontrivial inference costs, necessitating exploration of lighter-weight architectures for real-time applications (Li et al., 16 Mar 2026).
- Open-world generalization: Modality adapters, entropy-based selectors, and explicit geometric constraints are employed to mitigate out-of-distribution failures, but continued extension to richer tasks, online adaptation, and temporal consistency remains ongoing (Li et al., 14 Apr 2025, Zhang et al., 9 Mar 2026).
- Integration of additional modalities: Extension to full 3D grounding, affordance reasoning, tactile integration, and task planning inside world models are active topics for continued research (Li et al., 16 Mar 2026, He et al., 1 Dec 2025, Peng et al., 2023).
- Sample efficiency: Frontier checkpointing, formal skill decomposition, and leveraging trajectory feedback significantly improve sample efficiency and robustness for long-horizon tasks (Zabounidis et al., 10 Mar 2026, Shin et al., 2024).
7. Representative Benchmarks and Quantitative Results
Empirical results consistently indicate strong gains from pretrained skill grounding:
| Method/Domain | Metric | Performance |
|---|---|---|
| Probabilistic controller (Yang et al., 2023) | Success under perceptual noise | 95% (cross-road), >90% grounding accuracy |
| GTAC zero-shot (Seker et al., 16 May 2025) | Task success (novel objects) | ~90% (scraping, pouring, screwing); <1 cm keypoint error |
| SCALAR (Zabounidis et al., 10 Mar 2026) | Diamond collection (Craftax-Classic) | 88.2% (vs 46.9% baseline) |
| LocoVLM (Nahrendra et al., 11 Feb 2026) | Instruction-following accuracy | 91% (w/ VLM+LLM); 88% overall task success |
| RealVLG-R1 (Li et al., 16 Mar 2026) | Grasp success (real robot, cluttered) | 79% (vs. 2% for LGD) |
| SemGro (Shin et al., 2024) | Success rate (VirtualHome cross-domain) | 54% (vs 30% for SayCan, 22% LLM-Planner) |
The tight integration of pretrained skill libraries, multimodal grounding, and formal or empirical validation underpins the current state of world grounding with pretrained skills, with open avenues for greater autonomy, expressivity, and robustness in embodied AI systems.