Papers
Topics
Authors
Recent
Search
2000 character limit reached

Proprio-MLLM Planner in Dual-Arm Robots

Updated 17 March 2026
  • The paper introduces a novel MLLM-based embodied planning system that integrates proprioceptive signals with visual and linguistic data to significantly improve dual-arm task execution.
  • It details a multi-modal architecture combining VQ-VAE, a Cross-Spatial Encoder, and Motion-Based Position Embedding to enhance spatial reasoning and action planning.
  • Evaluation on the DualTHOR platform shows substantial performance gains over baselines, reducing both logical errors in arm selection and physical failures in task execution.

Proprio-MLLM Planner is a multimodal LLM (MLLM)-based embodied planning system that explicitly integrates proprioceptive signals for high-fidelity dual-arm humanoid robot task execution. Developed in conjunction with the DualTHOR simulation platform, it advances the state of receptively embodied planning by fusing visual, linguistic, and proprioceptive modalities through novel architectural components, enabling superior reasoning over dual-arm selection, body posture, and contingency management in complex long-horizon household tasks (Li et al., 9 Oct 2025).

1. Model Architecture

The Proprio-MLLM integrates four major architectural modules: a Motion Tokenizer based on vector quantized variational auto-encoders (VQ-VAE), a Cross-Spatial Encoder (CSE), motion-based position embedding (MPE), and a LLM core with multi-modal fusion.

  • Motion Tokenizer (VQ-VAE): Proprioceptive motion sequences m={m1,,mT}RT×dm = \{m_1, \ldots, m_T\} \in \mathbb{R}^{T \times d}, containing joint states and velocities, are encoded via a neural encoder EE to latent z(m)z(m), quantized to the nearest codebook entry cpc_p:

p=argminkz(m)ck2,e=cpp = \arg\min_k \|z(m) - c_k\|_2, \quad e = c_p

with combined reconstruction and commitment losses.

  • Cross-Spatial Encoder (CSE): Aggregates 2D vision (from Qwen2.5-VL) and 3D geometric maps (from CUT3R). Fused depth and affordance embeddings are produced as:

Tvision=MLP(F2D+Φ(F3D))T_{\text{vision}} = \mathrm{MLP}(F_{\text{2D}} + \Phi(F_{\text{3D}}))

providing spatial cues for dual-arm reachability.

  • Motion-Based Position Embedding (MPE): Injects explicit robot-centric encoding into each visual token representing pixel (x,y)(x, y) relative to body centroid (xr,yr)(x_r, y_r) and timestep tt:

MPE(t,x,y)=[t; sign(yyr); sign(xxr)]\mathrm{MPE}(t, x, y) = [t; \ \mathrm{sign}(y - y_r); \ \mathrm{sign}(x - x_r)]

fused with Qwen2.5-VL spatial tokenization. Unlike direct joint/pose embeddings, only quantized code is injected downstream.

  • LLM Core with Multi-Modal Fusion: All tokens—text, MPE-enhanced vision, proprioceptive motion code—are concatenated: HRL×DH \in \mathbb{R}^{L \times D}. Processing uses standard cross-attention:

Q=WQH,K=WKH,V=WVH,Attn(Q,K,V)=softmax(QKT/D)VQ = W_Q H, \quad K = W_K H, \quad V = W_V H, \quad \mathrm{Attn}(Q,K,V) = \mathrm{softmax}(QK^T/\sqrt{D})V

with LoRA adapters for low-rank adaptation of multi-modal attention.

2. Proprioceptive Signal Processing

Proprioceptive input comprises per-joint positions θt\theta_t, velocities θ˙t\dot{\theta}_t, and optional forces/torques, all normalized per dimension:

  • Positions: θ~=(θθmin)/(θmaxθmin)\tilde{\theta} = (\theta - \theta_{\min})/(\theta_{\max} - \theta_{\min})
  • Velocities: θ˙~=θ˙/θ˙max\tilde{\dot{\theta}} = \dot{\theta}/\dot{\theta}_{\max}
  • Forces: values clipped to [fmax,fmax][-f_{\max}, f_{\max}], then scaled

Signals are concatenated at each timestep to mtRdm_t \in \mathbb{R}^d. While standard coordinate frame alignment (via rotation matrices) is possible, Proprio-MLLM applies VQ-VAE quantization in the base frame, and MPE replaces global pixel coordinates with robot-relative sign encoding.

3. Training Objectives and Optimization

Training proceeds in three sequential stages, each with dedicated losses:

  • Stage 1 (Motion VQ-VAE): Minimizes a compound loss

LVQ=D(z(m))m2+αsg[z(m)]e2+βz(m)sg[e]2L_{\text{VQ}} = \|\mathrm{D}(z(m)) - m\|^2 + \alpha\|\mathrm{sg}[z(m)] - e\|^2 + \beta\|z(m) - \mathrm{sg}[e]\|^2

where sg[]\mathrm{sg}[\cdot] is the stop-gradient operator.

  • Stage 2 (Multi-Modal Alignment): Optimizes language generation cross-entropy LlangL_{\text{lang}}, plus optional motion-image/text contrastive term LcontraL_{\text{contra}} (Lalign=Llang+γLcontraL_{\text{align}} = L_{\text{lang}} + \gamma L_{\text{contra}}).
  • Stage 3 (Instruction Tuning): The final planner uses

Linstr=λ1Lplan+λ2Lcoord+λ3LembL_{\text{instr}} = \lambda_1 L_{\text{plan}} + \lambda_2 L_{\text{coord}} + \lambda_3 L_{\text{emb}}

where LplanL_{\text{plan}} is planning action cross-entropy, LcoordL_{\text{coord}} penalizes implausible arm assignments (e.g., left/right arm indicator misclassifications), and LembL_{\text{emb}} regularizes LoRA weights via L2 penalty.

At each stage, the active loss terms sum into the full objective.

4. Embodied Planning and Execution Pipeline

During inference on DualTHOR, the planning pipeline follows four stages per timestep:

  1. Input Encoding: Process human instruction into text tokens, encode current RGB image and compute image features and MPEs, VQ-encode the current proprioceptive vector.
  2. Token Fusion: Concatenate tokens and pass through the LLM, generating autoregressive outputs for high-level actions (e.g., MoveAhead, GripLeft).
  3. Action Execution: Actions are executed in DualTHOR. The environment returns next observation and status information.
  4. Contingency Handling: If a failure is detected (e.g., collision, unreachable object), the reason is appended to the language prompt and the planner iteratively replans.

This loop continues with proprio-visual-linguistic fusion at every step; stochastic, environment-triggered failures (contingencies) directly test replanning robustness.

5. Evaluation in DualTHOR and Performance Analysis

DualTHOR offers 10 household environments with 68 interactable objects, categorized for "Dual-Arm Essential" (tasks intractable for single-arm), "Dual-Arm Optional" (single arm feasible but slower), and "Single-Arm" baseline tasks.

Success Rate Comparison

Method Dual-Arm Essential X1 Dual-Arm Essential H1 Dual-Arm Optional X1 Dual-Arm Optional H1 Single-Arm X1 Single-Arm H1
GPT-4o 23.3% 27.1% 39.8% 41.0% 51.7% 56.7%
DAG-Plan 36.1% 41.5% 51.2% 52.4% 55.0% 58.3%
Proprio-MLLM 59.4% 63.2% 71.7% 70.5% 73.3% 75.0%

Proprio-MLLM demonstrates an average +19.75 percentage point improvement over DAG-Plan, with the most pronounced gain (+23.30 pp) in the "Dual-Arm Essential" category on the X1 platform.

Ablations and Statistical Results

  • Removing the CSE module decreases overall success by approximately 18 pp.
  • Eliminating MPE results in an ≈11 pp drop.
  • Removing both components yields performance equivalent to the DAG-Plan baseline.

Failure mode analysis (navigation, body adjustment, logical errors) shows statistically significant reductions in all categories (paired tt-test, p<0.01p < 0.01). The integrated proprioceptive signals directly mitigate logical (arm selection) and physical (reachability/posture) failures.

6. Implications, Limitations, and Future Work

Incorporating proprioception enhances spatial awareness, dual-arm logic, and robustness in contingency-rich environments. The MPE grounds image tokens in robot-centric coordinates, crucial for body-frame reasoning. The CSE supplies 3D range and affordance awareness, reducing unreachable or implausible action plans. The reflection/replanning mechanism (prompt augmentation on failure) leverages proprioceptive feedback for more effective high-level correction.

Future extensions proposed by the authors include support for multi-room and multi-agent coordination, explicit failure-mode control for data enrichment, integration with low-level policy learning (e.g., reinforcement learning), and wider support for whole-body and alternate humanoid morphologies (Li et al., 9 Oct 2025). A plausible implication is that increased proprioceptive resolution and more diverse embodiment data could further improve dual-arm logic generalization, especially for non-standard or damaged robots.

7. Context and Significance

Proprio-MLLM situates itself at the intersection of large-scale language modeling, embodied cognitive planning, and high-fidelity robot simulation. By introducing both an open-source dual-arm robotics platform (DualTHOR) and a multimodal learning algorithm explicitly structured for dual-arm, long-horizon tasks, it addresses longstanding gaps in MLLM embodiment awareness and challenging task evaluation, establishing a new practical and empirical benchmark for proprioceptive, language-grounded action planning in humanoid robotics (Li et al., 9 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Proprio-MLLM Planner.