CycleManip: Imitation Learning for Cyclic Tasks

Updated 7 December 2025

The paper introduces CycleManip, which uses a cost-aware historical perception strategy to accurately execute cycle-based tasks with exact repetitions within a deadline.
CycleManip fuses multimodal encoding of language, visual, and proprioceptive inputs with a diffusion-policy backbone and auxiliary cycle-stage prediction for robust action recognition.
Experimental results in simulation and real-world settings demonstrate significant gains over baselines, confirming its efficiency, versatility, and improved cycle execution accuracy.

CycleManip is a framework for end-to-end imitation learning that addresses cycle-based task manipulation—tasks where a robot must repeat a given motion pattern exactly $N$ times and stop within an expected terminal time. Core contributions include a cost-aware historical perception strategy and multi-task learning to enhance historical understanding, a benchmark suite for cyclic manipulation tasks, and an automated evaluation protocol, all supporting diverse robotic platforms in simulation and real-world settings (Wei et al., 30 Nov 2025).

1. Problem Definition and Formalization

Cycle-based manipulation tasks require the agent to execute a specified action pattern (a "cycle") exactly $N$ times within a deadline $T_{\max}$ . Each demonstration trajectory comprises a tuple $\left\{\mathtt{lan}, (o_1, a_1), (o_2, a_2), \dots, (o_T,a_T)\right\}$ , where $\mathtt{lan} \in \mathcal{L}$ is a natural language instruction (e.g., "shake bottle three times"), $o_t \in \mathcal{O}$ is the observation at time $t$ , and $a_t \in \mathcal{A}$ is the action.

The policy $\pi$ predicts:

$a_t = \pi\left(\mathtt{lan}, \{o_i\}_{i=1}^t\right).$

Let $C(t)$ be the cumulative cycles completed at $t$ , determined by task-specific detectors. An execution is successful if:

$C(T) = N \quad \text{and} \quad T \leq T_{\max},$

where $T$ is the episode length.

The objective is to maximize the probability that actions both accomplish exactly $N$ cycles and terminate before the deadline. This setting exposes the failure of Markovian-short-horizon policies in distinguishing visually identical cycles, motivating CycleManip’s history-centric design.

2. Architecture and Key Methodological Components

CycleManip is an end-to-end imitation pipeline on a diffusion policy backbone, augmented to handle non-Markovian cyclic dependencies. The framework consists of the following key modules (§3):

2.1 End-to-End Imitation with Multimodal Encoding

Language ( $\mathtt{lan}$ ) is encoded using a frozen CLIP text encoder ( $f_{\mathtt{lan}}$ ).
High-overhead sensory data (RGB frames or point clouds) are encoded ( $f_h$ ) via CNN or point cloud encoders.
Low-overhead proprioceptive signals (pose differences, denoted $o^l$ ) are encoded by a Transformer into $f_l$ .
Fused features ( $f_{lh}$ ) are formed by MLP-merging $f_h$ and $f_l$ , concatenated with $f_{\mathtt{lan}}$ to condition the diffusion policy for action prediction ( $a_t$ ).

Loss function (Eq. 7):

$\mathcal{L} = \alpha\,\mathrm{MSE}(a_t, a_t^*) + \beta\,\mathrm{CE}(y_t, y_t^*),$

where $a_t^*$ is the expert action, $y_t^*$ is the discretized cycle stage bin, and $\alpha=1, \beta=0.1$ are hyperparameters.

2.2 Effective Historical Perception: Cost-Aware Sampling (§3.2)

Low-overhead features ( $o^l$ ): All historical pose-deltas are encoded at each time step ( $\mathcal{H}_l = \{o_i^l\}_{i=1}^t$ ), ensuring rich, long-horizon proprioception context.
High-overhead features ( $o^h$ ): Sampling is performed under fixed budget $K_{\rm high}$ $K_{high}$ . Two heuristics are employed:
- Right-side binary: $K_{\rm high}/2$ evenly spread frames via interval bisection.
- Exponential: $K_{\rm high}/2$ exponentially recent frames, i.e., $t-2^k$ (clipped at $\geq 0$ ), for $k=0,\ldots,K_{\rm high}/2-1$ .
This split enables the policy to retain extended context without prohibitive computational cost.

2.3 Effective Historical Understanding: Multi-task Cycle-Stage Prediction (§3.3)

Cycle progress $b_t = t/T_{\max}$ , discretized into ten bins $y_t \in \{1, \ldots, 10\}$ , is predicted from fused features via an auxiliary MLP head.
Training uses a joint MSE (on action) and cross-entropy (on cycle-stage) loss, promoting features encoding both position-in-cycle and current action.
This structure facilitates the network's ability to distinguish "which repetition" of the cycle is underway, improving termination accuracy.

2.4 Training Loop (High-level)

For each demonstration trajectory, at each step $t$ :

Sample $H_h$ , $H_l$ via cost-aware strategy.
Encode features as above and predict $a_t$ (action) and $\hat y_t$ (cycle-stage).
Compute total loss and backpropagate.

3. Benchmark Environments and Evaluation Protocol

CycleManip includes a dedicated cyclical manipulation benchmark (§4):

3.1 Simulated Environments

Eight tasks are built on RoboTwin 2.0:

Task	Description	Cycles Supported
Block hammering	Contact manipulation	1–8
Bottle shaking	Non-contact, cyclic	1–8
Roller rolling	Rotational cycles	1–8
Carrot cutting	Sequential slicing	1–8
Dual-knife chopping	Bimanual, cyclic	1–8
Egg beating	Stirring/rotation	1–8
Chemical mixing	Agitated mixing	1–8
Morse tapping	Rhythmic task	1–8

Each has 200 expert demonstrations annotated with target cycles $N$ , running count $C(t)$ , and object pose.

3.2 Evaluation Metrics

Task-specific detectors count cycles:

Contact tasks: State-machine detects collision events.
Non-contact: Peak-detection on pose trajectories.

Metrics:

Success Rate (Suc): Fraction of rollouts with $C(T)=N$ .
Cycle Count Deviation (Cyc): Mean $|C(T)-N|$ .

Simulated tasks use 100 evaluation trials; real-world tasks use 16.

4. Experimental Results

4.1 Simulation (§5.3, Tab. 1)

Compared to state-of-the-art imitation baselines (DP, DP3, RDT, $\pi_0$ ), CycleManip achieves 86–97% success rates versus $<$ 35% for alternatives.
Cycle count deviations are $<$ 0.81, compared to 1.5–8.3 for baselines.
These results confirm that short-horizon policies lose track of cycle progress due to partial observability (§5, Fig. 2a).

4.2 Real-World Performance (§5.3, Tab. 2)

On block hammering, bottle shaking, drum beating, tire pumping, knife cutting, and table cleaning across AgileX Piper grippers, BrainCO Revo2 hands, and Unitree G1 humanoid:
- CycleManip: Success 50–100%, Cyc. deviation 0–1.5.
- DP3 baseline: Success 0–37.5%, Cyc. deviation 0.9–3.8.
Ablations show both cost-aware sampling and multi-task components contribute substantially to performance.

4.3 Generalization (§5.4, Tab. 3)

On four standard (non-cyclic) tasks—place-cans, handover, bottle-picking, stamp-seal—CycleManip outperforms all baselines (e.g., 91% vs 48% on place-cans).
This demonstrates that the history-aware feature architecture confers advantages beyond cyclic domains.

4.4 Plug-and-play with VLA Models (§5.6, Tab. 4)

Integrating only the low-overhead history Transformer from CycleManip into $\pi_0$ 19 increases cyclic task performance from 1–27% to 41–72%.
This modularity indicates retrofitting is possible with minimal architectural changes.

4.5 Efficiency (§5.7, Tab. 5)

Overhead for carrot-cutting is modest: $+0.03$ s/training step (0.073→0.102 s) and $+0.006$ GB GPU, for a success increase from 38% to 86%.

5. Principles and Insights

Non-Markovianity: Cyclic imitation tasks undermine standard Markovian/horizon-limited policies because identical observations at each cycle prevent accurate progress tracking (§5, Fig. 2a).
Cost-aware history: Combining dense low-overhead sampling with sparse, informative high-overhead sampling allows efficient, rich temporal context (§3.2).
Cycle-stage prediction: The cycle-stage auxiliary loss enforces learning of "repetition index," improving both performance and the ability to stop at exactly $N$ cycles (§3.3, Fig. 2b).
Versatility: CycleManip generalizes to both cyclic and standard manipulation, is compatible with VLA architectures, and supports diverse hardware configurations (§5.6).

6. Limitations and Future Directions

The provided benchmark supports up to 8 cycles; expanding to longer, variable-length, or hierarchical cycles remains open (§6).
Evaluation presently utilizes ground-truth pose or clean collision signals; extending evaluation to noisy, fully vision-based scenarios is an area for development.
Reinforcement learning could be incorporated to adaptively modulate movement speed or force for further performance enhancement in dynamic settings.
A plausible implication is that CycleManip’s modularity and low computational overhead make it suitable as a foundation for further research into long-horizon cyclic control and robotic skill composition.

7. Summary Table of CycleManip Components

Module	Purpose	Distinctive Mechanism
Cost-aware sampling	History perception	Dense low-overhead, sparse high-overhead
Cycle-stage auxiliary head	History understanding	Multi-task CE loss, normalized cycle index
Diffusion-policy backbone	End-to-end imitation	Multimodal fusion, language grounding

CycleManip constitutes a lightweight and effective approach to cyclic task imitation in robotics, offering precision history-awareness, adaptability to heterogeneous platforms, and practical utility across simulation and real-world deployments (Wei et al., 30 Nov 2025).

Markdown Upgrade to Chat

References (1)

CycleManip: Enabling Cyclic Task Manipulation via Effective Historical Perception and Understanding (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CycleManip Framework.