Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hierarchical Policy for Cluttered Manipulation

Updated 27 January 2026
  • The paper introduces a dual-level hierarchical RL framework that achieves up to 97% success in cluttered environments by decomposing tasks into high-level planning and low-level manipulation.
  • It employs a novel combination of spatially extended Q-update (SEQ) and a two-stage update scheme (TSUS) to enhance spatial credit assignment and stabilize policy learning.
  • Key implications include improved sample efficiency, robust generalization to varying clutter densities, and a foundation for future end-to-end hierarchical credit propagation.

Hierarchical Policy for Cluttered-scene Long-horizon Manipulation (HCLM) refers to a vision-based framework for solving complex multi-step manipulation tasks in environments characterized by severe object occlusion and clutter. This approach leverages hierarchical reinforcement learning (HRL) and option-based policy decomposition to instantiate and sequence primitives such as push, pick, and place, enabling agents to generalize to diverse clutter densities and long task horizons (Wang et al., 2023).

1. Formal Policy Decomposition and Option Structure

HCLM implements a dual-level policy: a high-level discrete planner and a set of option modules corresponding to parameterized manipulation primitives. The central agent operates on an orthographic RGB-D observation otRH×W×4o_t \in \mathbb{R}^{H \times W \times 4} at each step, producing the tuple at=(atH,atL)a_t = (a_t^H, a_t^L) where:

  • atH{push,pick+place}a_t^H \in \{\text{push}, \text{pick+place}\} indicates the primitive selection, output via Q-value maximization atH=argmaxaHQH(ot,aH)a^H_t = \arg\max_{a^H} Q_H(o_t, a^H).
  • atL=(x,y,θ)a_t^L = (x, y, \theta) specifies spatial parameters determined by the corresponding option module.

Push is parameterized by 2D pixel location and one of 12 discrete directions, pick utilizes a Transporter-based Q-map for location/angle selection, and place elaborates over 36 discrete placement orientations.

2. Training Paradigm: Behavior Cloning and Hierarchical RL

The pick and place sub-policies (options) are trained independently using supervised behavior cloning (BC) on data D={(oj,aj)}D = \{(o_j, a_j)\} sampled from oracle demonstrations. Training minimizes the cross-entropy loss

LBC=E(o,a)D[YlogsoftmaxQ(o)],L_{\text{BC}} = -\mathbb{E}_{(o,a) \sim D} [Y \cdot \log \text{softmax}\, Q(o)],

where Y is a one-hot label expansion corresponding to ground-truth actions.

The high-level policy and push option are then trained jointly via hierarchical RL. Exploration is managed through ε-greedy scheduling with PER. Special algorithmic innovations are introduced:

  • Spatially Extended Q-Update (SEQ): For push, a learned Q-map incorporates an anisotropic Gaussian spatial filter in the local action frame. The target is

Ytu=RtFilter+γηQu(ot,argmaxatQu(ot,at)),Y_t^u = R_t \cdot \text{Filter} + \gamma \eta Q_u(o_{t'}, \arg\max_{a_{t'}} Q_u(o_{t'}, a_{t'})),

leveraging only successful transitions for Q-value bootstrapping.

  • Two-Stage Update Scheme (TSUS): High-level Q-values are updated with adaptive weighting depending on the epoch and the origin (random/policy) of the push action, suppressing non-stationarity in HRL optimization.

3. Mathematical Agent Specification

The HCLM agent is formally defined as an MDP (S,A,P,R,γ)(\mathcal{S}, \mathcal{A}, P, R, \gamma) with:

  • State: RGB-D top-down image,
  • Action: tuple as above,
  • Reward: stepwise task progression (STP) function,

Rt={Δtask_progress,if Δtask_progress<0, W(atH)I(at,st+1),otherwise,R_t = \begin{cases} \Delta_{\text{task\_progress}}, & \text{if } \Delta_{\text{task\_progress}} < 0,\ \mathcal{W}(a_t^H) \mathcal{I}(a_t, s_{t+1}), & \text{otherwise}, \end{cases}

where W(atH)\mathcal{W}(a_t^H) weights by primitive, and I(at,st+1)\mathcal{I}(a_t, s_{t+1}) is a success indicator.

  • High-level and low-level Q-maps are trained with Huber losses on temporal-difference errors.

4. Network Architectures and Implementation

The "Dual-Level Action Network" (DLAN) serves as the backbone for joint high-level and push option inference:

  • Encoders: Two streams (RGB via frozen CLIP ResNet50; Depth via custom CNN) fuse spatial features for downstream convolutional decoding.
  • Push option: Up-convolution layers reconstruct dense Q-maps.
  • High-level branch: Fused features traverse conv + FC layers to produce Q-values over {push,pick+place}\{\text{push}, \text{pick}+ \text{place}\}.
  • Pick/place transporters: Share encoder architecture but omit the high-level branch, utilizing key-query attention and dense Q-maps.

All input images are normalized and rotated per action angle before processing, preserving geometric invariance.

5. Experimental Protocol and Results

Evaluation occurs on the ClutteredRavens benchmark (PyBullet, UR5e + suction gripper), across six manipulation scenarios with increasing clutter:

  • Metrics: Success rate (within maximum steps) and episode length.
  • Baselines: Two-stream Transporter (pick+place-only), alternating push policies, RoManNet (RL for GR-ConvNets).
  • Quantitative Outcomes: HCLM repeatedly outperforms all baselines, e.g., in "cluttered-block-insertion," HCLM achieves 97% success vs. strongest baseline at 56%. Efficiency (average episode length) is improved by 10–30% in the majority of tasks.

Robustness to increased clutter is demonstrated: in "cluttered-stack-block-pyramid," HCLM maintains ≥70% success even with 16 additional blocking objects, while best baseline performance does not exceed 34%.

Ablation studies confirm contributions:

  • No hierarchical policy: complete failure (0%).
  • No TSUS: 52% success.
  • No SEQ: 79%.

6. Design Innovations: SEQ and TSUS

The SEQ mechanism spreads Q-learning targets for push over a local affine region centered on the selected pixel and direction, improving spatial credit assignment. Adjacent direction slices receive a degraded bootstrapped target via factor κ<1\kappa < 1, enabling broader reward propagation and sample efficiency.

TSUS addresses instability inherent to HRL by dynamically selecting which training samples contribute to the high-level policy loss, suppressing the detrimental influence of non-policy (random) explorations early in training, then gradually integrating these as training progresses.

7. Generalization, Limitations, and Future Perspectives

HCLM demonstrates strong generalization to unseen clutter densities and longer manipulation horizons. Its architecture supports efficient option re-use and rapid inference. However, current formulations rely on frozen pick/place policies, limiting credit assignment across options. Furthermore, learning remains sensitive to exploration schedules and visual normalization heuristics. Prospective enhancements could include end-to-end hierarchical credit propagation, vision-based domain randomization, and adaptive curriculum strategies.

HCLM sets a precedent for hierarchical visual policy learning in cluttered, long-horizon manipulation, combining option-based policy structure with novel spatial and temporal update mechanisms, and achieves state-of-the-art empirical results across diverse task benchmarks (Wang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Policy for Cluttered-scene Long-horizon Manipulation (HCLM).