Proprio-MLLM Planner in Dual-Arm Robots
- The paper introduces a novel MLLM-based embodied planning system that integrates proprioceptive signals with visual and linguistic data to significantly improve dual-arm task execution.
- It details a multi-modal architecture combining VQ-VAE, a Cross-Spatial Encoder, and Motion-Based Position Embedding to enhance spatial reasoning and action planning.
- Evaluation on the DualTHOR platform shows substantial performance gains over baselines, reducing both logical errors in arm selection and physical failures in task execution.
Proprio-MLLM Planner is a multimodal LLM (MLLM)-based embodied planning system that explicitly integrates proprioceptive signals for high-fidelity dual-arm humanoid robot task execution. Developed in conjunction with the DualTHOR simulation platform, it advances the state of receptively embodied planning by fusing visual, linguistic, and proprioceptive modalities through novel architectural components, enabling superior reasoning over dual-arm selection, body posture, and contingency management in complex long-horizon household tasks (Li et al., 9 Oct 2025).
1. Model Architecture
The Proprio-MLLM integrates four major architectural modules: a Motion Tokenizer based on vector quantized variational auto-encoders (VQ-VAE), a Cross-Spatial Encoder (CSE), motion-based position embedding (MPE), and a LLM core with multi-modal fusion.
- Motion Tokenizer (VQ-VAE): Proprioceptive motion sequences , containing joint states and velocities, are encoded via a neural encoder to latent , quantized to the nearest codebook entry :
with combined reconstruction and commitment losses.
- Cross-Spatial Encoder (CSE): Aggregates 2D vision (from Qwen2.5-VL) and 3D geometric maps (from CUT3R). Fused depth and affordance embeddings are produced as:
providing spatial cues for dual-arm reachability.
- Motion-Based Position Embedding (MPE): Injects explicit robot-centric encoding into each visual token representing pixel relative to body centroid and timestep :
fused with Qwen2.5-VL spatial tokenization. Unlike direct joint/pose embeddings, only quantized code is injected downstream.
- LLM Core with Multi-Modal Fusion: All tokens—text, MPE-enhanced vision, proprioceptive motion code—are concatenated: . Processing uses standard cross-attention:
with LoRA adapters for low-rank adaptation of multi-modal attention.
2. Proprioceptive Signal Processing
Proprioceptive input comprises per-joint positions , velocities , and optional forces/torques, all normalized per dimension:
- Positions:
- Velocities:
- Forces: values clipped to , then scaled
Signals are concatenated at each timestep to . While standard coordinate frame alignment (via rotation matrices) is possible, Proprio-MLLM applies VQ-VAE quantization in the base frame, and MPE replaces global pixel coordinates with robot-relative sign encoding.
3. Training Objectives and Optimization
Training proceeds in three sequential stages, each with dedicated losses:
- Stage 1 (Motion VQ-VAE): Minimizes a compound loss
where is the stop-gradient operator.
- Stage 2 (Multi-Modal Alignment): Optimizes language generation cross-entropy , plus optional motion-image/text contrastive term ().
- Stage 3 (Instruction Tuning): The final planner uses
where is planning action cross-entropy, penalizes implausible arm assignments (e.g., left/right arm indicator misclassifications), and regularizes LoRA weights via L2 penalty.
At each stage, the active loss terms sum into the full objective.
4. Embodied Planning and Execution Pipeline
During inference on DualTHOR, the planning pipeline follows four stages per timestep:
- Input Encoding: Process human instruction into text tokens, encode current RGB image and compute image features and MPEs, VQ-encode the current proprioceptive vector.
- Token Fusion: Concatenate tokens and pass through the LLM, generating autoregressive outputs for high-level actions (e.g., MoveAhead, GripLeft).
- Action Execution: Actions are executed in DualTHOR. The environment returns next observation and status information.
- Contingency Handling: If a failure is detected (e.g., collision, unreachable object), the reason is appended to the language prompt and the planner iteratively replans.
This loop continues with proprio-visual-linguistic fusion at every step; stochastic, environment-triggered failures (contingencies) directly test replanning robustness.
5. Evaluation in DualTHOR and Performance Analysis
DualTHOR offers 10 household environments with 68 interactable objects, categorized for "Dual-Arm Essential" (tasks intractable for single-arm), "Dual-Arm Optional" (single arm feasible but slower), and "Single-Arm" baseline tasks.
Success Rate Comparison
| Method | Dual-Arm Essential X1 | Dual-Arm Essential H1 | Dual-Arm Optional X1 | Dual-Arm Optional H1 | Single-Arm X1 | Single-Arm H1 |
|---|---|---|---|---|---|---|
| GPT-4o | 23.3% | 27.1% | 39.8% | 41.0% | 51.7% | 56.7% |
| DAG-Plan | 36.1% | 41.5% | 51.2% | 52.4% | 55.0% | 58.3% |
| Proprio-MLLM | 59.4% | 63.2% | 71.7% | 70.5% | 73.3% | 75.0% |
Proprio-MLLM demonstrates an average +19.75 percentage point improvement over DAG-Plan, with the most pronounced gain (+23.30 pp) in the "Dual-Arm Essential" category on the X1 platform.
Ablations and Statistical Results
- Removing the CSE module decreases overall success by approximately 18 pp.
- Eliminating MPE results in an ≈11 pp drop.
- Removing both components yields performance equivalent to the DAG-Plan baseline.
Failure mode analysis (navigation, body adjustment, logical errors) shows statistically significant reductions in all categories (paired -test, ). The integrated proprioceptive signals directly mitigate logical (arm selection) and physical (reachability/posture) failures.
6. Implications, Limitations, and Future Work
Incorporating proprioception enhances spatial awareness, dual-arm logic, and robustness in contingency-rich environments. The MPE grounds image tokens in robot-centric coordinates, crucial for body-frame reasoning. The CSE supplies 3D range and affordance awareness, reducing unreachable or implausible action plans. The reflection/replanning mechanism (prompt augmentation on failure) leverages proprioceptive feedback for more effective high-level correction.
Future extensions proposed by the authors include support for multi-room and multi-agent coordination, explicit failure-mode control for data enrichment, integration with low-level policy learning (e.g., reinforcement learning), and wider support for whole-body and alternate humanoid morphologies (Li et al., 9 Oct 2025). A plausible implication is that increased proprioceptive resolution and more diverse embodiment data could further improve dual-arm logic generalization, especially for non-standard or damaged robots.
7. Context and Significance
Proprio-MLLM situates itself at the intersection of large-scale language modeling, embodied cognitive planning, and high-fidelity robot simulation. By introducing both an open-source dual-arm robotics platform (DualTHOR) and a multimodal learning algorithm explicitly structured for dual-arm, long-horizon tasks, it addresses longstanding gaps in MLLM embodiment awareness and challenging task evaluation, establishing a new practical and empirical benchmark for proprioceptive, language-grounded action planning in humanoid robotics (Li et al., 9 Oct 2025).