FALCON: Foundation-Model Guided Loco-Manipulation

Updated 7 December 2025

The paper introduces a decoupled policy architecture that separates locomotion and manipulation using diffusion-based control guided by semantic embeddings.
It employs a frozen vision-language model to create a shared semantic consensus channel, enabling robust cross-module coordination between specialized controllers.
Empirical evaluations demonstrate improved in-distribution and out-of-distribution performance with sample-efficient learning and effective skill arbitration.

FoundAtion-model-guided decoupled LoCO-maNipulation visuomotor policies (FALCON) refer to a class of modular robotic control architectures that leverage large-scale foundation models for policy guidance, explicit decoupling of locomotion and manipulation subsystems, and policy learning via diffusion-based techniques. The FALCON framework predominates in loco-manipulation scenarios, where a robot must coordinate navigation and manipulation, and has been investigated in multiple instantiations for both skill-centric and policy-centric paradigms (He et al., 4 Dec 2025, Ingelhag et al., 25 Mar 2024). These systems employ multimodal foundation models to mediate high-level coordination, selection, or spatial reasoning, while retaining separate, highly-specialized low-level or mid-level policies.

1. Decoupled Policy Architecture for Loco-Manipulation

FALCON introduces explicit factorization of the robot's policy into distinct modules for locomotion and manipulation. Rather than relying on a monolithic visuomotor policy that must process the union of all proprioceptive and perceptual inputs, the FALCON paradigm delineates separate diffusion-based policies for the base (locomotion) and arm (manipulation), each attending only to its own relevant observations. The justification for this decoupling lies in the inherent heterogeneity of the data distributions from each domain: legged base and dexterous end-effector observations are not statistically commensurate. A fused policy is susceptible to instability, conflicting gradients, and poor out-of-distribution (OOD) generalization, as each subtask may require disparate forms of spatial and sensory reasoning (He et al., 4 Dec 2025).

This architectural separation mitigates performance degradation witnessed in centralized approaches. Each policy can be independently optimized and conditioned on inputs most relevant to its function, yet is jointly coordinated through a shared semantic embedding extracted from foundation models.

2. Foundation Model-Guided Semantic Coordination

Central to FALCON is the use of a frozen vision-language foundation model (e.g., CLIP) to form a "semantic consensus channel." This global channel computes an embedding $z_t = f_{\mathrm{CLIP}}(I_t, s_t, p_{\text{text}}) \in \mathbb{R}^D$ from RGB images $I_t$ , global proprioceptive state $s_t$ , and a natural-language task instruction $p_{\text{text}}$ (He et al., 4 Dec 2025). The CLIP encoders are held fixed; only lightweight trainable adapters are optimized to merge image and text encodings into a shared latent.

Both the locomotion and manipulation diffusion policies receive their respective local observation streams, augmented with this shared $z_t$ . This arrangement ensures both specialization and semantic consensus, allowing coordination without direct observation sharing. The semantic embedding encodes high-level intent and perceptual context, acting analogously to an interpretable, frozen communication bus between subsystems.

This design contrasts with prior approaches that attempt to fuse raw sensor streams directly, which often results in poor transferability and overfitting to low-level correlations.

3. Diffusion-Policy Learning and Modular Control

The FALCON framework utilizes diffusion policies for both base and arm control—a form of behavior cloning where policies are trained to denoise trajectories from Gaussian noise into action sequences closely matching demonstration data (Ingelhag et al., 25 Mar 2024, He et al., 4 Dec 2025). The diffusion loss is of the form:

$\mathcal{L}_{\text{diff}} = \mathbb{E}_{\tau,\epsilon} \left[ \|\epsilon - \epsilon_{\theta/\phi}(a_0, z, \tau)\|_2^2 \right]$

where $a_0$ is the ground-truth action chunk, $\tau$ indexes the diffusion timestep, and $\epsilon_{\theta/\phi}$ is the learned action denoiser.

The policies are modular:

Locomotion diffusion policy: predicts base velocity, heading, and height commands.
Manipulation diffusion policy: predicts desired end-effector position (processed via inverse kinematics), or, in skill-centric frameworks, decoupled 6D Cartesian velocity (Lo) and gripper/contact activation signal (CO) (Ingelhag et al., 25 Mar 2024).

In skill-centric instantiations, a skill library is constructed by teleoperated demonstration and diffusion policy training for each skill, with explicit skill selection and precondition checking mediated by large language and vision-LLMs (Ingelhag et al., 25 Mar 2024).

4. Phase-Progress Head and Latent Space Structuring

A salient extension in the FALCON paradigm is the introduction of a phase-progress head, which imbues the semantic embedding with temporal task structure derived from language. Human-authored textual phase prompts (e.g., for stages such as "move," "align," "place," "close") are encoded into CLIP text-space embeddings. At each timestep, the system computes:

A per-frame belief over $K$ discrete phases, using the dot product between current features and phase embeddings,
A continuous progress scalar within the most likely phase, using the similarity gap between "ongoing" and "done" prompt embeddings.

This augments $z_t$ with temporal context, providing coarse supervisory signal for phase inference without explicit phase labels. This component stabilizes diffusion policy training and helps structure the latent space to reflect high-level task progression (He et al., 4 Dec 2025).

5. Coordination-Aware Contrastive Alignment

FALCON’s coordination-aware contrastive loss enforces that the shared latent embedding $z_t$ is predictive of compatible joint arm/base actions. The approach constructs hard negatives by selectively shuffling arm or base actions within a mini-batch, and applies an InfoNCE-style loss to ensure alignment between embeddings and corresponding action summaries:

$\mathcal{L}_{\text{coord}} = \frac12(\mathcal{L}_{z \to u} + \mathcal{L}_{u \to z})$

where $\mathcal{L}_{z \to u}$ is a log-softmax loss tying $z$ to the true $u = [\bar{a}^{\mathrm{arm}}; \bar{a}^{\mathrm{quad}}]$ . This explicitly shapes the latent space such that $z_t$ encodes cross-subsystem compatibility, enhancing coordination and robustness, particularly for tasks with tightly coupled navigation and manipulation requirements (He et al., 4 Dec 2025).

6. Foundation Model Integration and Skill Arbitration

Within skill-centric frameworks, FALCON integrates foundation models for both skill arbitration and execution gating (Ingelhag et al., 25 Mar 2024). At runtime, user natural-language requests are processed by a LLM (LLM, e.g., GPT-4 or Gemini) to select the most appropriate skill from the skill library. Each skill is associated with concise, language-based precondition descriptions. Verification of preconditions is performed by a vision-LLM using the current observation. Only when preconditions are satisfied is the corresponding diffusion policy executed.

This two-layered arbitration ensures modularity and continual expandability; new skills can be added by demonstration with minimal system-wide retraining. Skill selection via LLMs achieves high match accuracy (GPT-4: 96.3%; Gemini: 93.0%), with VLM precondition verification accuracy up to 77.5%. Combined end-to-end pick rates are reported at 74.6% (Ingelhag et al., 25 Mar 2024).

7. Empirical Performance, Ablations, and Robustness

Empirical evaluations of FALCON span both simulation and physical hardware. In centralized, decentralized, and FALCON-style ablations, the decoupled architecture with semantic coordination achieves superior in-distribution and OOD generalization. For example, in a mobile manipulation task (navigation, drawer opening, object placement), FALCON attains 100% overall phase-wise success, whereas alternatives such as LatentToM, CDP, and ACT attain significantly lower success rates, especially on manipulation stages and OOD regions (He et al., 4 Dec 2025). Human-in-the-loop trials demonstrate the continued efficacy of each subsystem when the other is teleoperated.

Ablation studies reveal substantial performance gains from the phase-progress head—removal drops task success from 100% to ~60%—and a further boost from the contrastive loss in coupling arm and base actions. In skill-centric frameworks, FALCON demonstrates sample-efficient learning (≈100 demonstrations per skill), robust execution of contact-rich tasks, and reliable integration of precondition verification and tool arbitration.

8. Significance and Future Directions

The FALCON paradigm establishes a scalable approach to loco-manipulation: modular, foundation-model-guided policies that are robust, interpretable, and sample-efficient. By combining the strengths of specialized low-level controllers with interpretable, semantically-informed global coordination, FALCON addresses key challenges in multimodal embodiment, generalization, and human-robot interaction. Ongoing research is poised to refine cross-modal transfer, enhance phase segmentation, and further exploit foundation models for richer forms of grounded reasoning and lifelong learning (He et al., 4 Dec 2025, Ingelhag et al., 25 Mar 2024).