CATCH-FORM-ACTer: Hybrid Robotic Manipulation

Updated 16 December 2025

CATCH-FORM-ACTer is a hybrid framework that combines tactile compliance control with transformer-driven action chunking for precise viscoelastic object manipulation.
It employs a two-layer control system with an admittance outer-loop and a PDE-stabilized inner-loop to achieve sub-millimeter deformation tracking and robust force regulation.
The system dynamically modulates mechanical and regulatory parameters in real time, demonstrating superior performance in industrial, medical, and household applications.

CATCH-FORM-ACTer is a hybrid control and learning framework for contact-rich, viscoelastic object manipulation with rigid robots. It integrates compliance-aware tactile control with a hybrid deformation regulation system, enhanced by an Action Chunking with Transformer (ACT) architecture that enables dynamic, real-time adaptation of mechanical and regulatory parameters during manipulation. CATCH-FORM-ACTer achieves robust performance in industrial, medical, and household manipulation tasks, leveraging rich multimodal sensory fields to bridge the gap between human-level dexterity and traditional robotic systems (Ma et al., 11 Apr 2025).

1. Compliance-Aware Tactile and Deformation Control

CATCH-FORM-ACTer regulates end-effector deformation $x(t) \in \mathbb{R}^3$ using a two-layer hybrid controller:

1.1. Admittance Outer-Loop Dynamics

The outer loop enforces compliant end-effector motion in response to measured contact force $F_c(t) \in \mathbb{R}^3$ by admittance control:

$M \ddot{x}(t) + B \dot{x}(t) + K ( x(t) − x_{\mathrm{ref}}(t) ) = F_c(t)$

with virtual inertia $M$ , stiffness $K$ , and damping $B$ , all typically set diagonal; $x_{\mathrm{ref}}(t)$ is the setpoint from the high-level policy. The controller tracks a desired deformation $\delta(t) = x(t) - x_{\mathrm{ref}}(t)$ , ensuring that contact force deviations induce compliant, low-stress responses.

1.2. PDE-Stabilized Inner-Loop Hybrid Deformation Regulation

The inner loop guarantees sub-millimeter tracking of the viscoelastic deformation field $\Phi(x, y, z, t)$ . It implements a PDE-based reaction-diffusion boundary control law over the object domain $\Omega \subset \mathbb{R}^3$ :

$\rho\,\frac{\partial^2 u}{\partial t^2} = \nabla \cdot [E\,\nabla u + \eta\,\nabla(\partial u/\partial t)] + \ddot{u}_{\mathrm{command}}$

where $u$ is the deformation, $\rho$ the density, $E$ elastic modulus, and $\eta$ viscosity. Boundary conditions enforce $u(t)|_\Gamma = u_{\mathrm{ref}}(t)$ and Neumann zero-flux. Lyapunov analysis confirms global convergence ( $u \to u_{\mathrm{ref}}$ ) and damping of oscillations.

2. Action Chunking with Transformer: Architecture and Training

CATCH-FORM-ACTer extends prior CATCH-FORM-3D work by introducing an ACT (Action Chunking Transformer) module that governs both reference trajectories and compliance gain schedules.

2.1. Input Modalities and Encoding

At each timestep, observations include: RGB-D images (two palm, one global), discretized contact force fields $f_t \in \mathbb{R}^{12 \times 10 \times 3}$ , surface deformation fields $\Phi_t \in \mathbb{R}^{12 \times 10 \times 3}$ , and proprioceptive states (end-effector pose $X_t \in \mathbb{R}^6$ and hand joints $h_t \in \mathbb{R}^{13}$ ). A CVAE encodes these into $z \sim \mathcal{N}(\mu, \Sigma)$ ; $z$ is used only during training.

2.2. Transformer Decoder (ACT)

The ACT module consists of 6-layer, 8-head, $d_{\mathrm{model}}=256$ transformers with sinusoidal positional encoding tailored to 0.1 s action chunks (10 steps at 100 Hz). The causal attention mask enforces strict auto-regressive prediction over action chunks.

Each action embedding (per arm, per timestep) is 22-dimensional (6D pose, 13 joint positions, 3 compliance parameters). For bimanual manipulation, $N=44$ action dimensions.

2.3. Training Regime

The model is trained by learning from demonstration (LfD):

20–30 human demonstrations per task, teleoperated via motion capture.
Data augmentation includes spatial jitter, object rotation ( $\pm 15^\circ$ ), and contrast modulation.
Loss: $L_\text{total} = L_\text{recon} + \beta D_{KL}\left[q(z|o)\,\|\,\mathcal{N}(0,I)\right]$ , $L_\text{recon} = \|\text{a}_{\text{pred}} - \text{a}_{\text{demo}}\|_2^2$ , $\beta = 0.1$ .
Hyperparameters: batch size 64, learning rate $10^{-4}$ (cosine decay, 80k epochs), Adam optimizer.

3. Real-Time Dynamic Parameter Modulation

CATCH-FORM-ACTer uniquely outputs compliance gains and PDE regularization parameters as part of each action vector, allowing the manipulation policy to shape physical interaction properties online.

3.1. Network Output Squashing

Transformer outputs $o_\lambda \in \mathbb{R}^3$ are mapped into physical gain ranges by:

$\begin{align*} \lambda_1(t) &= \lambda_1^{\mathrm{min}} + (\lambda_1^{\mathrm{max}}-\lambda_1^{\mathrm{min}})\cdot\sigma(o_{\lambda,1}) \ \lambda_2(t) &= \lambda_2^{\mathrm{min}} + (\lambda_2^{\mathrm{max}}-\lambda_2^{\mathrm{min}})\cdot\sigma(o_{\lambda,2}) \ \epsilon(t) &= \epsilon^{\mathrm{min}} + (\epsilon^{\mathrm{max}}-\epsilon^{\mathrm{min}})\cdot\sigma(o_{\lambda,3}) \end{align*}$

Respective physical ranges: $\lambda_1 \in [50,500]$ N/m, $\lambda_2 \in [0.1,5.0]$ Ns/m, $\epsilon \in [0.01,0.1]$ m $^2$ /s. Gains are applied to the outer/inner loops at 10 Hz.

3.2. Algorithmic Execution

The real-time workflow alternates between sensing (RGB-D, tactile, proprioception), inference (CVAE encoding, transformer decoding), action execution, and compliance update; control loops operate at 100 Hz. Pseudocode and detailed stepwise routines are explicitly provided in (Ma et al., 11 Apr 2025).

4. Experimental Benchmarking

Validation is performed using two 6-DOF Realman RM65-B arms (PaXini DexH13 hands), a 12×10 tactile sensor grid at 200 Hz, and three RGB-D cameras. Three canonical tasks are evaluated: bimanual cylinder insertion (2 mm clearance), single-arm peg-insertion ( $\pm 15^\circ$ randomization), and wiping (random mark, $\pm5$ cm).

Success Rates Across Methods

Method	Pick→Insert	Wiping	Bimanual
ACT	40 %	50 %	40 %
Comp-ACT	65 %	70 %	70 % ✓
CATCH-FORM-ACTer	85 %	90 %	80 % ✓
Ours* (no fields)	65 %	75 %	45 %

✓: dynamic compliance adaptation.

Force-field visualizations indicate that successful manipulations correspond to uniform, concentric patterns across the tactile grid; failures manifest as asymmetric force spikes that are undetectable by standard wrist F/T alone.

5. Phase-Aware Adaptation and Policy Structure

CATCH-FORM-ACTer’s transformer-based policy modulates compliance parameters (stiffness, damping, diffusion) "phase-aware"—adjusting mechanical regulation as task segments (approach, contact, insertion) evolve, without explicit pre-segmentation or manual tuning.

This allows adaptation to object and scenario heterogeneity (e.g., variable material properties in viscoelastic bodies), and enables the system to robustly avoid parameter misconfiguration that would otherwise lead to unstable oscillatory contacts or excessive deformation. This suggests improved model-agnostic generalization compared to policy architectures that do not integrate compliance tuning (Ma et al., 11 Apr 2025).

6. Limitations and Future Directions

While CATCH-FORM-ACTer demonstrates superior performance and flexible adaptation, several constraints remain:

Dependence on high-resolution tactile sensors and multi-view vision systems increases calibration and hardware cost.
Transformer inference at 10 Hz may not sustain very high-speed manipulations.
The PDE-based deformation controller assumes known continuum model parameters (density, modulus, viscosity), limiting direct application to objects with unknown or highly time-varying properties.

A plausible implication is that integration of onboard or external real-time parameter identification or sensor fusion could further increase robustness. The current architecture allows substitution of centralized state estimation (e.g., via MoCap or teleop) with future onboard vision or distributed sensor networks.

7. Applications and Extensions

CATCH-FORM-ACTer applies to:

Industrial assembly: precise peg-in-hole and flexible part alignment, especially with deformable or tolerance-sensitive materials.
Medical robotics: tasks such as soft tissue palpation or controlled tool-to-tissue interaction requiring adaptive compliance.
Household tasks: safe, robust manipulation of deformable food, textiles, and cleaning surfaces.

Extensions may incorporate decentralized coordination or multi-arm collaboration and replace the central coordinator with distributed, onboard sensing and consensus for increased scalability and fault-tolerance, as suggested by analogous developments in decentralized multi-agent systems (Agrawal et al., 2016).