Hierarchical Policy for Cluttered Manipulation
- The paper introduces a dual-level hierarchical RL framework that achieves up to 97% success in cluttered environments by decomposing tasks into high-level planning and low-level manipulation.
- It employs a novel combination of spatially extended Q-update (SEQ) and a two-stage update scheme (TSUS) to enhance spatial credit assignment and stabilize policy learning.
- Key implications include improved sample efficiency, robust generalization to varying clutter densities, and a foundation for future end-to-end hierarchical credit propagation.
Hierarchical Policy for Cluttered-scene Long-horizon Manipulation (HCLM) refers to a vision-based framework for solving complex multi-step manipulation tasks in environments characterized by severe object occlusion and clutter. This approach leverages hierarchical reinforcement learning (HRL) and option-based policy decomposition to instantiate and sequence primitives such as push, pick, and place, enabling agents to generalize to diverse clutter densities and long task horizons (Wang et al., 2023).
1. Formal Policy Decomposition and Option Structure
HCLM implements a dual-level policy: a high-level discrete planner and a set of option modules corresponding to parameterized manipulation primitives. The central agent operates on an orthographic RGB-D observation at each step, producing the tuple where:
- indicates the primitive selection, output via Q-value maximization .
- specifies spatial parameters determined by the corresponding option module.
Push is parameterized by 2D pixel location and one of 12 discrete directions, pick utilizes a Transporter-based Q-map for location/angle selection, and place elaborates over 36 discrete placement orientations.
2. Training Paradigm: Behavior Cloning and Hierarchical RL
The pick and place sub-policies (options) are trained independently using supervised behavior cloning (BC) on data sampled from oracle demonstrations. Training minimizes the cross-entropy loss
where Y is a one-hot label expansion corresponding to ground-truth actions.
The high-level policy and push option are then trained jointly via hierarchical RL. Exploration is managed through ε-greedy scheduling with PER. Special algorithmic innovations are introduced:
- Spatially Extended Q-Update (SEQ): For push, a learned Q-map incorporates an anisotropic Gaussian spatial filter in the local action frame. The target is
leveraging only successful transitions for Q-value bootstrapping.
- Two-Stage Update Scheme (TSUS): High-level Q-values are updated with adaptive weighting depending on the epoch and the origin (random/policy) of the push action, suppressing non-stationarity in HRL optimization.
3. Mathematical Agent Specification
The HCLM agent is formally defined as an MDP with:
- State: RGB-D top-down image,
- Action: tuple as above,
- Reward: stepwise task progression (STP) function,
where weights by primitive, and is a success indicator.
- High-level and low-level Q-maps are trained with Huber losses on temporal-difference errors.
4. Network Architectures and Implementation
The "Dual-Level Action Network" (DLAN) serves as the backbone for joint high-level and push option inference:
- Encoders: Two streams (RGB via frozen CLIP ResNet50; Depth via custom CNN) fuse spatial features for downstream convolutional decoding.
- Push option: Up-convolution layers reconstruct dense Q-maps.
- High-level branch: Fused features traverse conv + FC layers to produce Q-values over .
- Pick/place transporters: Share encoder architecture but omit the high-level branch, utilizing key-query attention and dense Q-maps.
All input images are normalized and rotated per action angle before processing, preserving geometric invariance.
5. Experimental Protocol and Results
Evaluation occurs on the ClutteredRavens benchmark (PyBullet, UR5e + suction gripper), across six manipulation scenarios with increasing clutter:
- Metrics: Success rate (within maximum steps) and episode length.
- Baselines: Two-stream Transporter (pick+place-only), alternating push policies, RoManNet (RL for GR-ConvNets).
- Quantitative Outcomes: HCLM repeatedly outperforms all baselines, e.g., in "cluttered-block-insertion," HCLM achieves 97% success vs. strongest baseline at 56%. Efficiency (average episode length) is improved by 10–30% in the majority of tasks.
Robustness to increased clutter is demonstrated: in "cluttered-stack-block-pyramid," HCLM maintains ≥70% success even with 16 additional blocking objects, while best baseline performance does not exceed 34%.
Ablation studies confirm contributions:
- No hierarchical policy: complete failure (0%).
- No TSUS: 52% success.
- No SEQ: 79%.
6. Design Innovations: SEQ and TSUS
The SEQ mechanism spreads Q-learning targets for push over a local affine region centered on the selected pixel and direction, improving spatial credit assignment. Adjacent direction slices receive a degraded bootstrapped target via factor , enabling broader reward propagation and sample efficiency.
TSUS addresses instability inherent to HRL by dynamically selecting which training samples contribute to the high-level policy loss, suppressing the detrimental influence of non-policy (random) explorations early in training, then gradually integrating these as training progresses.
7. Generalization, Limitations, and Future Perspectives
HCLM demonstrates strong generalization to unseen clutter densities and longer manipulation horizons. Its architecture supports efficient option re-use and rapid inference. However, current formulations rely on frozen pick/place policies, limiting credit assignment across options. Furthermore, learning remains sensitive to exploration schedules and visual normalization heuristics. Prospective enhancements could include end-to-end hierarchical credit propagation, vision-based domain randomization, and adaptive curriculum strategies.
HCLM sets a precedent for hierarchical visual policy learning in cluttered, long-horizon manipulation, combining option-based policy structure with novel spatial and temporal update mechanisms, and achieves state-of-the-art empirical results across diverse task benchmarks (Wang et al., 2023).