DexCompose: Policy Composition for Dexterous Manipulation

Updated 4 July 2026

DexCompose is a framework for post-hoc composition of pretrained dexterous manipulation policies by partitioning finger control between retention and interaction tasks.
It employs explicit finger-level action ownership and asymmetric residual modules to stabilize the grasp and adapt the downstream controller within a restricted action subspace.
Evaluation on 16 composite tasks in simulation reports a 77.43% success rate, significantly outperforming conventional chaining and baseline residual methods.

Searching arXiv for the specified paper and closely related work to ground the article in current literature. DexCompose is a framework for post-hoc composition of pretrained dexterous manipulation policies on a single multi-fingered hand. It addresses composite tasks in which a first skill establishes an object-retention state and a second skill must then be executed without invalidating that state, such as opening a door without dropping a ball. The framework keeps the original task policies frozen, assigns explicit finger-level action ownership between the preservation and downstream roles, and learns two asymmetric residual modules that stabilize the retained object and adapt the downstream controller within a restricted action subspace. In evaluation on 16 composite tasks, DexCompose reports a 77.43% average composite success rate, substantially above conventional chaining and standard residual baselines (Huang et al., 26 Jun 2026).

1. Problem setting and formal objective

DexCompose studies multi-task dexterous manipulation with a single multi-fingered hand. The setting is organized around two primitive tasks. Task A is a hold-and-retain skill, such as grasping a ball, picking up a can, holding a mug in a pouring configuration, or holding a stick. Task B is a downstream interaction, such as opening a door, pushing a button, opening a microwave, or turning on a switch. The composite task succeeds only if Task B is completed while Task A’s outcome remains valid throughout the rollout (Huang et al., 26 Jun 2026).

At time step $t$ , the observation is $o_t$ and the action is

$a_t = (p_t, r_t, q_t) \in \mathbb{R}^{28},$

where $p_t \in \mathbb{R}^3$ is wrist translation, $r_t \in \mathbb{R}^3$ is wrist rotation in Euler angles, and $q_t \in \mathbb{R}^{22}$ are Shadow Hand joint targets in the experiments. Two frozen pretrained policies share this action space: a Task A policy $T_A$ and a Task B policy $T_B$ . DexCompose does not retrain these base policies; it learns only ownership masks and residual modules.

The composition problem is defined over rollouts that start from a successful Task-A state. Let $S_A(\mathcal{T}) \in \{0,1\}$ indicate whether Task A’s outcome is preserved throughout the trajectory $\mathcal{T}$ , and let $o_t$ 0 indicate whether Task B succeeds. The objective is to maximize the expected product $o_t$ 1, so partial completion is not counted as success.

The central difficulty is destructive interference. Both primitive policies act through the same full-hand action space, so naively chaining $o_t$ 2 and $o_t$ 3 causes the downstream controller to overwrite finger motions that are essential for maintaining the object-retention state. The paper identifies three specific sources of failure: overlapping DoFs, conflicting contact modes, and interference between grasp-preserving and interaction-seeking actions. It also notes that directly training one policy for every composite pair scales as $o_t$ 4 for $o_t$ 5 retention skills and $o_t$ 6 downstream skills, which is prohibitive. DexCompose therefore reframes embodiment redundancy as an allocation problem: determine which finger DoFs are strictly necessary for preserving Task A, and assign the remaining DoFs to Task B.

2. Primitive skill library and policy assumptions

The framework assumes a library of eight pretrained primitive policies, one per skill. The Task A policies are GraspBall, PickStick, PickCan, and PourMug. The Task B policies are OpenDoor, PushButton, OpenMicrowave, and TurnOnSwitch. These produce 16 composite task pairs.

Skill family	Primitive tasks
Task A: object retention	GraspBall, PickStick, PickCan, PourMug
Task B: downstream interaction	OpenDoor, PushButton, OpenMicrowave, TurnOnSwitch

Each primitive policy is a conditional diffusion model trained via behavior cloning on human demonstrations; no RL is used to train the base policies. The policy maps the last two observations $o_t$ 7 to a horizon of future actions, but only the first action is executed in receding-horizon fashion. Observation dimension is reported as 32–37, consisting of joint states and task-specific geometric features. Action space is the 28-dimensional wrist-plus-hand vector described above (Huang et al., 26 Jun 2026).

This design choice is important because DexCompose is not a joint multitask training procedure. Its contribution lies in reusing pretrained full-hand diffusion controllers as black-box policies and adding only lightweight structure around them. A plausible implication is that the method is best understood as a composition layer for an existing dexterous policy library rather than as a replacement for primitive-skill learning.

3. Finger-level action ownership and release tests

DexCompose introduces explicit finger-level ownership. Some fingers are assigned to maintain Task A, while the remaining fingers, together with the wrist, are allocated to execute Task B. Let $o_t$ 8 denote a finger ownership mask over the five fingers of the Shadow Hand. A value $o_t$ 9 means finger $a_t = (p_t, r_t, q_t) \in \mathbb{R}^{28},$ 0 is preserved for Task A; $a_t = (p_t, r_t, q_t) \in \mathbb{R}^{28},$ 1 means it is released to Task B. The mask is then expanded to the joint level, partitioning hand joints into Task-A-owned and Task-B-owned subsets. During composition, the Task A side may modify only mask-1 joints, whereas the Task B side may modify wrist DoFs and mask-0 joints (Huang et al., 26 Jun 2026).

To determine which fingers are actually necessary for retention, DexCompose first collects a library of successful post-task states for each Task-A skill. For each such skill, the offline pipeline stores 4096 held states $a_t = (p_t, r_t, q_t) \in \mathbb{R}^{28},$ 2, each containing a full simulator state after Task A success and the reference joint configuration that achieved the hold. These states support both mask selection and stabilizer training.

For a candidate mask, DexCompose performs a release test. Preserved fingers replay the reference grasp, while released fingers are gradually opened over a fixed test horizon $a_t = (p_t, r_t, q_t) \in \mathbb{R}^{28},$ 3 toward a fully open or neutral pose. For each candidate mask, the framework runs $a_t = (p_t, r_t, q_t) \in \mathbb{R}^{28},$ 4 trials and records three statistics. The first is the retention rate $a_t = (p_t, r_t, q_t) \in \mathbb{R}^{28},$ 5, the fraction of trials in which the object remains retained at the end of the test. The second is the clean-release rate $a_t = (p_t, r_t, q_t) \in \mathbb{R}^{28},$ 6, the fraction of trials in which the released fingers no longer provide significant contact support. The third is a residual dexterity measure $a_t = (p_t, r_t, q_t) \in \mathbb{R}^{28},$ 7, such as the number of released fingers or joints, indicating how much action capacity is freed for Task B.

Mask selection is then delegated to AgentSelect, which takes candidate masks together with $a_t = (p_t, r_t, q_t) \in \mathbb{R}^{28},$ 8, $a_t = (p_t, r_t, q_t) \in \mathbb{R}^{28},$ 9, and $p_t \in \mathbb{R}^3$ 0, along with task descriptions and instructions about the trade-off between retention and dexterity. In the reported implementation, AgentSelect is an LLM. The paper contrasts this with a heuristic baseline that maximizes retention subject to releasing at least two fingers. The heuristic performs worse, which the authors attribute to the importance of semantic reasoning about finger roles. The representative example is GraspBall + OpenDoor: the heuristic favors a thumb–index grasp that maximizes retention but occupies the index finger, whereas the LLM favors a thumb–ring–little grasp that is slightly less stable yet frees the index finger and improves composite success.

4. Asymmetric residual composition and runtime execution

After selecting the final mask $p_t \in \mathbb{R}^3$ 1, DexCompose trains two asymmetric residual policies. The first is a Task-A residual stabilizer $p_t \in \mathbb{R}^3$ 2, a PPO policy that outputs bounded corrections on preserved joints only. The second is a Task-B residual $p_t \in \mathbb{R}^3$ 3, also learned with PPO, that adapts the frozen Task B policy within the action subspace owned by Task B.

From the joint-level mask, the framework constructs full action-space masks $p_t \in \mathbb{R}^3$ 4 with $p_t \in \mathbb{R}^3$ 5. Wrist DoFs always belong to Task B. The Task-A stabilizer stores the hand configuration $p_t \in \mathbb{R}^3$ 6 at the moment Task A succeeds and then produces small bounded corrections $p_t \in \mathbb{R}^3$ 7 using a $p_t \in \mathbb{R}^3$ 8-bounded output scaled by a per-joint vector $p_t \in \mathbb{R}^3$ 9. The resulting preserved-joint command is

$r_t \in \mathbb{R}^3$ 0

The Task A action vector has zero wrist motion and the preserved-joint command. Its inputs include preserved finger joint positions and velocities, object pose and velocity relative to the palm, fingertip–object distances, binary contacts, torque features for preserved joints, the previous residual output, the executed base action, and a phase indicator. The reward is

$r_t \in \mathbb{R}^3$ 1

with $r_t \in \mathbb{R}^3$ 2. Training uses PPO for approximately 24.6M steps per Task-A skill, in isolation, with random disturbances applied to base and released fingers.

The Task-B residual starts from the frozen nominal downstream action $r_t \in \mathbb{R}^3$ 3 and adds a bounded masked correction

$r_t \in \mathbb{R}^3$ 4

Its reward augments the environment-defined Task-B reward with an object-to-palm distance penalty and a residual regularizer:

$r_t \in \mathbb{R}^3$ 5

with $r_t \in \mathbb{R}^3$ 6 and $r_t \in \mathbb{R}^3$ 7. The stabilizer may optionally be fine-tuned jointly during Task-B residual training.

At runtime, composition proceeds in stages. First, the frozen Task-A policy is executed until Task A succeeds. Second, control is transferred to the composition regime, optionally through a short transition stage. Third, concurrent execution begins. The final action is assembled as

$r_t \in \mathbb{R}^3$ 8

where $r_t \in \mathbb{R}^3$ 9. This masked synthesis enforces non-overlapping control responsibilities: Task A cannot write to Task-B-owned dimensions, and Task B cannot overwrite preserved joints.

5. Experimental setting and empirical results

Experiments are conducted in NVIDIA Isaac Lab, built on Isaac Sim, using a Shadow Dexterous Hand with 24 DoFs. The system uses 22 finger joints plus a 6-DoF floating-base wrist, yielding the 28-dimensional control vector. The evaluation suite spans four retention skills and four downstream interactions, producing 16 composite tasks. Standalone base-policy success is reported as approximately 100% for grasp skills and 82–98% for interaction skills. Composite success is defined as $q_t \in \mathbb{R}^{22}$ 0, where $q_t \in \mathbb{R}^{22}$ 1 only if the object remains retained throughout execution. Each composite pair is evaluated over 32 rollouts across 8 seeds, for 256 rollouts per pair (Huang et al., 26 Jun 2026).

The main quantitative comparison is against four baselines: Frozen (naive sequential execution with a frozen grasp), Decomp. (Decomposed Action Space), FullRes. (unmasked residual learning), and Ours-ZS (DexCompose without Task-B residual).

Method	Average composite success
Frozen	3.09%
Decomp.	45.61%
FullRes.	61.60%
Ours-ZS	69.23%
Ours (DexCompose)	77.43%

These results imply an improvement of approximately 74.3 percentage points over direct chaining and 15.8 points over the strongest reported baseline, FullRes. The paper also reports that DexCompose performs particularly well on more disturbance-prone interactions such as OpenMicrowave and TurnOnSwitch.

Preservation analysis separates capability retention on the two sides. Frozen-Grasp yields an A-side preservation ratio of approximately 0.102 and a B-side preservation ratio of approximately 0.489. DexCompose raises these to approximately 0.811 and 0.893, respectively. The reported interpretation is that the method occupies the upper-right region of preservation-versus-composition plots: high composite success without destroying the underlying primitive capabilities.

Failure analysis identifies four failure modes: Init fail, Transition fail, Object dropped during B, and B-fail. Transition failure is described as a major source of error. TurnOnSwitch exhibits more preservation failures, including drops and B-fail events caused by aggressive contact, whereas OpenMicrowave shows more pure interaction failures without many drops. PushButton is comparatively robust.

Ablation results clarify the contribution of each component. The full method averages 77.43%. Removing the Task-B residual (Ours-ZS) reduces this to 69.23%. Removing finger allocation (-Finger) reduces it to 54.52%. Removing the Task-A residual (-Task-A) causes a collapse to 9.13%. Removing action masking (-Mask) reduces performance to 59.13%, and removing the explicit transition stage (-Trans.) yields 69.02%. These figures support the paper’s central claim that bounded residual stabilization and structural action ownership are the dominant sources of robustness.

The LLM-based mask selector also outperforms the heuristic baseline on the reported subset of task pairs. Mean success rises from 66.5% for the heuristic to 73.0% for the LLM. On GraspBall + OpenDoor, the numbers are 72.1% versus 82.7%; on GraspBall + TurnOnSwitch, 63.5% versus 71.9%; on PickCan + OpenMicrowave, 68.3% versus 72.4%; and on PourMug + TurnOnSwitch, 62.0% versus 64.9%.

6. Limitations, conceptual significance, and relation to adjacent work

The paper states several explicit limitations. DexCompose composes exactly two sequential skills, one Task A and one Task B; longer chains and overlapping multi-skill compositions are not addressed. All experiments are in simulation, with no real-world transfer study. The method assumes access to many successful Task-A rollouts, simulator state saving and restoration for held-state collection and release tests, and a fixed hand morphology centered on the Shadow Hand. Finger allocation further depends on task-specific LLM prompts and precomputed release statistics (Huang et al., 26 Jun 2026).

Within those limits, DexCompose advances a specific formulation of dexterous compositionality: action-space resource allocation under contact constraints. Its main contribution is not semantic task understanding, zero-demonstration synthesis, or generalist language-conditioned control, but explicit structural partitioning of a single hand’s DoFs and residual adaptation around frozen primitive policies. This suggests a narrower but more mechanically explicit notion of composition than systems that compose through language or high-level planning alone.

This placement becomes clearer in relation to adjacent work. CoDex studies Compositional Dexterous Functional Object Manipulation and derives local and global semantic constraints from VLMs, then combines analytic constrained optimization with RL to produce grasp–move–actuate policies without demonstrations; its emphasis is semantic constraint extraction and function-oriented dexterous behavior, rather than reuse of frozen primitive policies (Jiang et al., 30 Jun 2026). DexVLA studies a language-driven general robot controller built around a plug-in 1B-parameter diffusion action expert and an embodiment curriculum, emphasizing cross-embodiment generality, long-horizon competence, and sub-step reasoning for general robot control (Wen et al., 9 Feb 2025). Relative to those directions, DexCompose isolates a distinct research question: how to preserve one dexterous behavior while introducing another on the same hand without retraining the original policies for every pair.

The paper’s reported future directions follow directly from that framing: extension to longer-horizon compositions with more than two skills, transfer to real hardware, adaptation to different morphologies or bimanual settings, generalization of action ownership beyond fingers, and replacement of LLM-based mask selection with learned allocation policies that retain interpretability. In that sense, DexCompose presents structural action ownership and dual residuals as a concrete pattern for dexterous skill composition beyond conventional policy chaining.