CycleManip: Imitation Learning for Cyclic Tasks
- The paper introduces CycleManip, which uses a cost-aware historical perception strategy to accurately execute cycle-based tasks with exact repetitions within a deadline.
- CycleManip fuses multimodal encoding of language, visual, and proprioceptive inputs with a diffusion-policy backbone and auxiliary cycle-stage prediction for robust action recognition.
- Experimental results in simulation and real-world settings demonstrate significant gains over baselines, confirming its efficiency, versatility, and improved cycle execution accuracy.
CycleManip is a framework for end-to-end imitation learning that addresses cycle-based task manipulation—tasks where a robot must repeat a given motion pattern exactly times and stop within an expected terminal time. Core contributions include a cost-aware historical perception strategy and multi-task learning to enhance historical understanding, a benchmark suite for cyclic manipulation tasks, and an automated evaluation protocol, all supporting diverse robotic platforms in simulation and real-world settings (Wei et al., 30 Nov 2025).
1. Problem Definition and Formalization
Cycle-based manipulation tasks require the agent to execute a specified action pattern (a "cycle") exactly times within a deadline . Each demonstration trajectory comprises a tuple , where is a natural language instruction (e.g., "shake bottle three times"), is the observation at time , and is the action.
The policy predicts:
Let be the cumulative cycles completed at , determined by task-specific detectors. An execution is successful if:
where is the episode length.
The objective is to maximize the probability that actions both accomplish exactly cycles and terminate before the deadline. This setting exposes the failure of Markovian-short-horizon policies in distinguishing visually identical cycles, motivating CycleManip’s history-centric design.
2. Architecture and Key Methodological Components
CycleManip is an end-to-end imitation pipeline on a diffusion policy backbone, augmented to handle non-Markovian cyclic dependencies. The framework consists of the following key modules (§3):
2.1 End-to-End Imitation with Multimodal Encoding
- Language () is encoded using a frozen CLIP text encoder ().
- High-overhead sensory data (RGB frames or point clouds) are encoded () via CNN or point cloud encoders.
- Low-overhead proprioceptive signals (pose differences, denoted ) are encoded by a Transformer into .
- Fused features () are formed by MLP-merging and , concatenated with to condition the diffusion policy for action prediction ().
Loss function (Eq. 7):
where is the expert action, is the discretized cycle stage bin, and are hyperparameters.
2.2 Effective Historical Perception: Cost-Aware Sampling (§3.2)
- Low-overhead features (): All historical pose-deltas are encoded at each time step (), ensuring rich, long-horizon proprioception context.
- High-overhead features (): Sampling is performed under fixed budget . Two heuristics are employed:
- Right-side binary: evenly spread frames via interval bisection.
- Exponential: exponentially recent frames, i.e., (clipped at ), for .
- This split enables the policy to retain extended context without prohibitive computational cost.
2.3 Effective Historical Understanding: Multi-task Cycle-Stage Prediction (§3.3)
- Cycle progress , discretized into ten bins , is predicted from fused features via an auxiliary MLP head.
- Training uses a joint MSE (on action) and cross-entropy (on cycle-stage) loss, promoting features encoding both position-in-cycle and current action.
- This structure facilitates the network's ability to distinguish "which repetition" of the cycle is underway, improving termination accuracy.
2.4 Training Loop (High-level)
For each demonstration trajectory, at each step :
- Sample , via cost-aware strategy.
- Encode features as above and predict (action) and (cycle-stage).
- Compute total loss and backpropagate.
3. Benchmark Environments and Evaluation Protocol
CycleManip includes a dedicated cyclical manipulation benchmark (§4):
3.1 Simulated Environments
Eight tasks are built on RoboTwin 2.0:
| Task | Description | Cycles Supported |
|---|---|---|
| Block hammering | Contact manipulation | 1–8 |
| Bottle shaking | Non-contact, cyclic | 1–8 |
| Roller rolling | Rotational cycles | 1–8 |
| Carrot cutting | Sequential slicing | 1–8 |
| Dual-knife chopping | Bimanual, cyclic | 1–8 |
| Egg beating | Stirring/rotation | 1–8 |
| Chemical mixing | Agitated mixing | 1–8 |
| Morse tapping | Rhythmic task | 1–8 |
Each has 200 expert demonstrations annotated with target cycles , running count , and object pose.
3.2 Evaluation Metrics
Task-specific detectors count cycles:
- Contact tasks: State-machine detects collision events.
- Non-contact: Peak-detection on pose trajectories.
Metrics:
- Success Rate (Suc): Fraction of rollouts with .
- Cycle Count Deviation (Cyc): Mean .
Simulated tasks use 100 evaluation trials; real-world tasks use 16.
4. Experimental Results
4.1 Simulation (§5.3, Tab. 1)
- Compared to state-of-the-art imitation baselines (DP, DP3, RDT, ), CycleManip achieves 86–97% success rates versus 35% for alternatives.
- Cycle count deviations are 0.81, compared to 1.5–8.3 for baselines.
- These results confirm that short-horizon policies lose track of cycle progress due to partial observability (§5, Fig. 2a).
4.2 Real-World Performance (§5.3, Tab. 2)
- On block hammering, bottle shaking, drum beating, tire pumping, knife cutting, and table cleaning across AgileX Piper grippers, BrainCO Revo2 hands, and Unitree G1 humanoid:
- CycleManip: Success 50–100%, Cyc. deviation 0–1.5.
- DP3 baseline: Success 0–37.5%, Cyc. deviation 0.9–3.8.
- Ablations show both cost-aware sampling and multi-task components contribute substantially to performance.
4.3 Generalization (§5.4, Tab. 3)
- On four standard (non-cyclic) tasks—place-cans, handover, bottle-picking, stamp-seal—CycleManip outperforms all baselines (e.g., 91% vs 48% on place-cans).
- This demonstrates that the history-aware feature architecture confers advantages beyond cyclic domains.
4.4 Plug-and-play with VLA Models (§5.6, Tab. 4)
- Integrating only the low-overhead history Transformer from CycleManip into 19 increases cyclic task performance from 1–27% to 41–72%.
- This modularity indicates retrofitting is possible with minimal architectural changes.
4.5 Efficiency (§5.7, Tab. 5)
- Overhead for carrot-cutting is modest: s/training step (0.073→0.102 s) and GB GPU, for a success increase from 38% to 86%.
5. Principles and Insights
- Non-Markovianity: Cyclic imitation tasks undermine standard Markovian/horizon-limited policies because identical observations at each cycle prevent accurate progress tracking (§5, Fig. 2a).
- Cost-aware history: Combining dense low-overhead sampling with sparse, informative high-overhead sampling allows efficient, rich temporal context (§3.2).
- Cycle-stage prediction: The cycle-stage auxiliary loss enforces learning of "repetition index," improving both performance and the ability to stop at exactly cycles (§3.3, Fig. 2b).
- Versatility: CycleManip generalizes to both cyclic and standard manipulation, is compatible with VLA architectures, and supports diverse hardware configurations (§5.6).
6. Limitations and Future Directions
- The provided benchmark supports up to 8 cycles; expanding to longer, variable-length, or hierarchical cycles remains open (§6).
- Evaluation presently utilizes ground-truth pose or clean collision signals; extending evaluation to noisy, fully vision-based scenarios is an area for development.
- Reinforcement learning could be incorporated to adaptively modulate movement speed or force for further performance enhancement in dynamic settings.
- A plausible implication is that CycleManip’s modularity and low computational overhead make it suitable as a foundation for further research into long-horizon cyclic control and robotic skill composition.
7. Summary Table of CycleManip Components
| Module | Purpose | Distinctive Mechanism |
|---|---|---|
| Cost-aware sampling | History perception | Dense low-overhead, sparse high-overhead |
| Cycle-stage auxiliary head | History understanding | Multi-task CE loss, normalized cycle index |
| Diffusion-policy backbone | End-to-end imitation | Multimodal fusion, language grounding |
CycleManip constitutes a lightweight and effective approach to cyclic task imitation in robotics, offering precision history-awareness, adaptability to heterogeneous platforms, and practical utility across simulation and real-world deployments (Wei et al., 30 Nov 2025).