MoDE VLA Architectures

Updated 15 March 2026

The paper introduces MoDE VLA architectures that combine sparse, context-aware expert modules with pretrained VLA backbones to enhance dexterous robotic manipulation.
Key methods include expert routing, residual refinement, and late-stage sensory fusion which optimize performance and computational efficiency in contact-rich tasks.
Empirical evaluations demonstrate significant success rate improvements across complex maneuvers such as in-hand reorientation, grasping, and multi-step assembly.

Mixture-of-Dexterous-Experts (MoDE) Vision-Language-Action Architectures

The Mixture-of-Dexterous-Experts (MoDE) paradigm encompasses a family of architectures that combine pretrained vision-language-action (VLA) backbones with sparse, context-aware expert modules to advance dexterous robotic manipulation. MoDE models apply mixture-of-experts methodologies to integrate heterogeneous sensory modalities (e.g., force, tactile, proprioception) and to orchestrate expert sub-policies or residual refinements, optimizing both performance and computational efficiency in multi-modal, contact-rich manipulation tasks. Contemporary MoDE VLA systems leverage objectives such as flow matching, expert specialization, and real-time adaptation to deliver robust, interpretable, and scalable manipulation policies across a diverse array of robotic tasks, including in-hand object reorientation, grasping, and multi-step assembly.

1. Core Architectural Concepts

MoDE-VLA models situate a sparse expert module atop a pretrained VLA backbone, such as OpenPI-0 or PaliGemma-based transformers. The sensory input stream comprises visual (multiple RGB images), linguistic (tokenized instructions), proprioceptive, force, and tactile signals. These are embedded into a unified token sequence, partitioned into prefix (vision/language/proprio) and suffix (action horizon) tokens.

MoDE blocks are instantiated as multi-expert modules. Each expert is realized as a small feedforward network (often an MLP), with a dedicated routing/gating network. Sparse routing is the norm (top-1 or top-k selection), facilitating efficient per-token or per-modality dispatch. For instance, in MoDE-VLA, force and tactile tokens are routed independently through an 8-expert MoE, with the most relevant expert selected for each sub-modality at each horizon step (Tang et al., 9 Mar 2026).

Residual injection is a defining trait: expert outputs (force- and tactile-refined features) are added as corrections to the backbone's predicted velocity field. Architecturally, this enables the MoDE block to modulate arm and hand action channels separately via learned projections, yielding contact-aware adjustments that reinforce but do not overwrite pretrained priors.

2. Expert Routing and Decoupling Strategies

Modern MoDE implementations decouple expert selection from expert contribution, targeting both specialization and load balancing. In the AdaMoE mechanism (Shen et al., 16 Oct 2025), two distinct linear adapters operate per token: a router computes softmax-normalized selection weights $g_i(x)$ , while an independent scale adapter determines per-expert scalar adjustments $s_i(x)$ :

$F_{\rm AdaMoE}(x) = F_{\rm shared}(x) + \sum_{i=1}^E \mathbf{1}_{i\in \text{top-}k}\,[\mathrm{softmax}(r(x))_i + s_i(x)]\,f_i(x)$

This full decoupling allows the router to enforce uniform expert usage (for training stability and expert utilization) while the scale adapter can optimize for task performance loss independently. Standard MoE balancing losses $\mathcal{L}_{\rm balance}$ ensure equitable routing.

DGMoE (Miao et al., 4 Aug 2025) extends this paradigm with a dual-gating mechanism: a token-side router and an expert-side gating threshold jointly decide expert activation. The expert-side "self-aware" gate enables veto if certain tokens are mismatched with desired dexterity levels, supporting hierarchical, skill-level decompositions.

3. Sensory Fusion and Input Modalities

MoDE systems tightly integrate sensory signals via embedding and self-attention fusion. For force-aware architectures (e.g., ForceVLA (Yu et al., 28 May 2025)), six-axis force-torque readings are projected via a linear mapping into the model's latent token space and jointly processed with vision/language tokens in an MoE block. Tactile inputs (e.g., from multi-finger sensors) are similarly embedded, encoded, and fused.

Fusion often occurs late in the pipeline (after the backbone VLM), as empirical evidence demonstrates that late-stage (post-transformer) MoE fusion preserves pretrained priors and enhances contact-aware adaptation. Early fusion strategies (before or within VLM) are found to degrade performance or fail to leverage MoE specialization (Yu et al., 28 May 2025).

MoDE input representations are differentiated along task axes (vision, language, proprioception, force, tactile), with cross-attention or self-attention layers enabling information flow. Task-specific or context-adaptive routing (scene-based, expertise-based) governs expert activation.

Expert specialization emerges along functional lines:

Free-space motion (low contact force)
Initial contact detection (force spikes)
Stable grasp/control (steady force regime)
Dynamic in-hand manipulation (tactile-slip, high-compliance adjustments)

Router and gating statistics reveal that certain experts consistently encode generalist strategies while others specialize for alignment, insertion, or sustained contact phases (Yu et al., 28 May 2025). For fine-grained control, MoDE systems support hierarchical or stacked expert arrangements: e.g., a first-level router might select between coarse manipulation types (grasp, push, in-hand rotate), with a nested router handling sub-primitives.

Residual refinement realized via MoDE modules allows the backbone to propose global action trajectories (as denoising velocity fields), with expert-corrected residuals injected for force/tactile-guided fine control. This is formalized as:

$\mathbf{v}_\theta(\mathbf{x}_t, t) = [W_1 (Z_f + Z_{\text{suffix}})\ \|\ W_2 (Z_g + Z_{\text{suffix}})]$

where $[\,\|\ ]$ denotes concatenation across arm/hand action channels, and $Z_f, Z_g$ are MoDE expert outputs for force/tactile (Tang et al., 9 Mar 2026).

In some systems, e.g., ResDex (Huang et al., 2024), the MoE module linearly combines pre-trained geometry-unaware base policies (experts) with an RL-learned residual, achieving both strong generalization and refinement on unseen geometries.

5. Training Methodologies and Objectives

Supervised learning (imitation, flow-matching) and reinforcement learning (e.g., PPO) are both prevalent in MoDE regimes:

MoDE-VLA employs a flow-matching loss, optimizing denoising velocity fields for continuous-time action diffusion.
RL-based primitives ("IMCopilot") are trained with asymmetric actor-critic policies and structured rewards to perform atomic skills for teleoperation assistance or low-level manipulation.
AdaMoE (Action-Specialized MoE) employs a compound objective combining task loss and load-balancing penalty.

Key training tactics include expert weight initialization from frozen pretrained weights, gradient isolation for router/scale heads, and domain randomization for sim-to-real robustness. Modality dropout (randomly masking force/vision during training) further enhances cross-modal generalization (Yu et al., 28 May 2025).

6. Empirical Evaluation and Ablative Analyses

MoDE-VLA architectures consistently surpass dense and monolithic baselines on dexterous manipulation benchmarks:

On real-world dual-arm tasks, AdaMoE yields SR improvements of +21.5% (71.5% vs. 50.0%) (Shen et al., 16 Oct 2025).
MoDE-VLA doubles average SR (34% vs. 15%) across complex bimanual tasks such as apple peeling, tube rearranging, and gear assembly (Tang et al., 9 Mar 2026).
ForceVLA achieves up to +23.2% average SR improvement and 80% success on plug insertion, bottle pumping, and continuous contact tasks (Yu et al., 28 May 2025).
In-hand reorientation (DexReMoE) shows worst-case performance increases from 0.69 (monolithic) to 6.05 (DexReMoE) for consecutive successes, with average $\bar S$ ≈ 19.5 on 150 diverse objects, in-distribution and out-of-distribution (Wan et al., 3 Aug 2025).
ResDex achieves 88.8% vision-based grasp success on 3,200 objects, with no generalization gap to unseen objects (Huang et al., 2024).

Ablations reveal that removal of force or tactile modalities in MoDE-VLA degrades average SR by 11% and 8%, respectively. The most severe failures are observed on contact-rich insertion and slip-prone tasks. Expert-count and routing hyperparameter sweeps consistently point to optimal values (e.g., 4 or 8 experts, top-k=1), balancing specialization and statistical efficiency.

Configuration and activation statistics show that MoDEs maintain robust expert utilization, avoid collapse, and expose interpretable gating correlates for different task phases.

7. Applications, Interpretability, and Design Guidelines

MoDE architectures enable real-time, robust, and adaptive control for:

Bimanual and high-DOF dexterous manipulation, including in-hand rotation, tool use, and assembly (Tang et al., 9 Mar 2026, Wan et al., 3 Aug 2025).
Contact-rich tasks involving visual occlusion, dynamic uncertainty, or non-visual modalities (Yu et al., 28 May 2025).
Universal grasping across broad object distributions through geometry-agnostic expert priors (Huang et al., 2024).
Hierarchical or federated control where modularity and interpretability are essential (e.g., FedVLA, DGMoE (Miao et al., 4 Aug 2025)).

Design recommendations synthesized from empirical findings:

Insert sparse MoE blocks (E=4–8, top-k=1) at action expert layers; inherit pretrained weights and fine-tune only routers/scale adapters.
Fuse force/tactile via late-stage residual injection for stability and preservation of VLM priors.
Regularize expert routing with load-balancing objectives and, where applicable, expert-side gating for action sensitivity.
In federated or multi-client regimes, aggregate expert updates proportional to activation statistics for effective knowledge sharing.

The MoDE framework substantiates the thesis that distributed, context-adaptive expertise—implemented via sparse, multimodal, and residual-corrective experts—offers a scalable, interpretable, and computationally efficient pathway to human-like dexterous robotic manipulation in the VLA domain (Tang et al., 9 Mar 2026, Shen et al., 16 Oct 2025, Yu et al., 28 May 2025, Huang et al., 2024, Wan et al., 3 Aug 2025, Miao et al., 4 Aug 2025).