Luxical Architecture for Robotic Imitation

Updated 11 December 2025

Luxical Architecture is a novel framework that integrates human demonstration capture, spatiotemporal alignment, and generative vision modeling to enable efficient robotic imitation learning.
It employs both vision-only and tactile-plus-vision pipelines to bridge the gap between human hand morphology and robotic grippers through careful sensor data synchronization and retargeting.
The architecture leverages diffusion-based policy learning to translate processed demonstration data into robust robot joint trajectories, achieving high success rates in contact-rich manipulation tasks.

The Glove2Robot policy pipeline refers to a family of end-to-end systems that transform raw human hand demonstrations—collected via wearable sensing or wrist-mounted camera—into robot-executable policies for dexterous manipulation. These pipelines automate the data capture, preprocessing, representation alignment, and policy learning process, enabling sample-efficient behavioral cloning and diffusion-based imitation learning without requiring human teleoperation of the robot or direct hand-held gripper data collection. Notably, these pipelines address the visual and embodiment gap between human and robot morphologies using generative vision models, spatial alignment techniques, magnetotactile hardware, and retargeting algorithms. State-of-the-art variants include vision-only pipelines (e.g., RwoR) and those leveraging tactile glove hardware (e.g., OSMO), with both demonstrating efficient skill transfer in contact-rich robotic manipulation tasks.

1. Human Demonstration Acquisition and Sensor Modalities

Glove2Robot pipelines employ distinct strategies for capturing human demonstrations, dictated by the targeted embodiment gap and application.

Vision-Only (RwoR-style): A GoPro Hero 9 with a Max Lens Mod 1.0 (≈160° FOV) is mounted on the demonstrator’s wrist using a custom 3D-printed adapter. The physical calibration permits rotation about the wrist axis and fore/aft translation (12–15 cm from the fingertips), producing video streams that mimic the robot’s deployment view. Each 4K frame is timestamped at 30 Hz, and IMU data at 200 Hz is recorded for subsequent pose inference. No instrumented glove or tactile sensing is required; proprioception is estimated post hoc using camera and IMU (Heng et al., 5 Jul 2025).
Tactile+Vision (OSMO): The OSMO pipeline uses a magnetotactile glove with 12 tri-axial magnetic sensors (Bosch BMM350, per-taxel) on the fingertips and palm, allowing direct measurement of normal and shear contact forces (0.3–80 N range, ≈300 µT/N). Sensor data is sampled at 100 Hz, daisy-chained via I²C to an STM32 at the wrist, and aggregated at 25 Hz with RGB/IR camera streams (e.g., Intel RealSense D435) (Yin et al., 9 Dec 2025).

Both variants record synchronized streams—either vision/IMU, or vision/tactile/IMU—comprising hundreds of demonstration trajectories (e.g., 140 wiping tasks for OSMO) with high temporal resolution.

2. Data Preprocessing, Alignment, and Retargeting

To render human demonstrations amenable to robot learning, the pipelines employ a sequence of spatiotemporal alignment and normalization procedures:

Temporal and Visual Alignment (Vision-Only): Given asynchronously captured human and robot demonstration sequences, Temporal Cycle-Consistency (TCC) embedding [Dwibedi et al., 2019] is used to learn image embeddings ( $f\,{:}\;\text{Image}\to\mathbb{R}^d$ ) and establish optimal nearest-neighbor correspondences between frames. The timestamp-aligned sequence pairs are further refined by foreground-background composition: Using SAM2 for segmentation, robot-action foregrounds are inpainted onto background-cropped human frames, producing composite images $\hat r_t = (1-M_t)\odot h_t + M_t\odot r_t$ with cleanly aligned visual content (Heng et al., 5 Jul 2025).
Sensor Data Processing (Tactile+Vision): Per-taxel signals are normalized using percentile-based scaling (Eq. 3 in (Yin et al., 9 Dec 2025)) and differential readings to attenuate common-mode crosstalk. Per-frame hand pose is estimated through segmentation (SAM2), 3D keypoint regression (HaMeR), stereo-based point cloud backprojection (FoundationStereo), and Savitzky-Golay smoothing. Kinematic retargeting maps human finger/wrist targets to 7 DoF Franka arm and 6 DoF Ability Hand joint commands using inverse kinematics solutions (Python IK with MuJoCo) (Yin et al., 9 Dec 2025).
Embodiment-Gap Minimization: When humans and robots share identical tactile gloves (OSMO), the mapping $f:g^H\mapsto g^R$ is the identity; thus, no inpainting or force inference is needed.

3. Vision-Based Generative Modeling and SE(3) Extraction

To overcome the domain gap between human hands and robot grippers in visual data:

Hand-to-Gripper Generative Model (RwoR): An InstructPix2Pix (IP2P) network, built on Stable Diffusion, is fine-tuned to reconstruct aligned robot ("gripper") observations from human hand images and natural-language prompts (e.g., “Turn the hand into a gripper holding a cup”). The model is trained over 15,000 aligned samples for three epochs, with standard diffusion loss. During inference, raw human videos are converted into realistic, robot-centric visual trajectories $\tilde r_t$ (Heng et al., 5 Jul 2025).
SE(3) Action Extraction: ORB-SLAM3 is run on the wrist-mounted fisheye video to recover camera world poses $\{T_{c_t}^w\}$ , further refined by IMU. A fixed extrinsic transform, $T_f^c$ , maps camera to fingertip frames, yielding the gripper pose sequence $T_{f_t}^w = T_{c_t}^w T_f^c$ . Actions are defined as body-frame deltas $\Delta T_t=(T_{f_t}^w)^{-1}T_{f_{t+1}}^w\in\mathrm{SE}(3)$ , with gripper open/close state $g_t$ determined from mask transitions (Heng et al., 5 Jul 2025).

A plausible implication is that joint availability of tactile and visual modes (as in OSMO) could further contextualize policy learning via multimodal fusion.

4. Policy Learning Architectures

The processed human-to-robot demonstration data is compiled into state-action sequences amenable to policy optimization, most commonly using diffusion-based architectures:

Policy Structure (RwoR): Each observation $o_t = \{\tilde r_{t-k+1},\ldots,\tilde r_t,\,T_{f_t}^w,\,g_t\}$ pairs with action $a_t=(\Delta T_t,g_{t+1})$ . Conditional diffusion policies $\pi_\phi(a|o)$ are trained using U-Net architectures with cross-attention on $o$ , following Chi et al., 2023, over $K=1,000$ diffusion steps; Adam optimization is applied (learning rate $10^{-4}$ , batch size 32, 100 epochs). At test time, live wrist-camera frames drive closed-loop execution via sampled $\Delta T_t$ and $g_t$ , mapped to robot joints through inverse kinematics (Heng et al., 5 Jul 2025).
Policy Structure (OSMO): Policies $\pi_\theta(a|F_k)$ condition on tuples $F_k=(I_{\text{rgb}},q,g)$ , where $a$ is a trajectory chunk of future arm+hand joint targets. Encoders include a frozen DINOv2 for images, MLPs for joint and tactile data, concatenated for FiLM-conditioned U-Net denoising over 100 DDPM steps. Standard DDPM loss is used. Training utilizes Adam (lr $2\times10^{-4}$ , batch 128, 2000 epochs), random vision crops, and tactile normalization; DINOv2 is frozen, and no regularization (dropout, weight decay) is applied (Yin et al., 9 Dec 2025).
Robot Execution: Predicted joint trajectories are streamed to Franka/Ability Hand low-level impedance controllers ( $\approx1$ kHz inner loop; high-level policy at $2$ Hz), providing compliance and robust contact pressure maintenance (Yin et al., 9 Dec 2025).

5. Empirical Results and Generalization

Experimental evaluations substantiate the sample efficiency and embodiment gap closure enabled by Glove2Robot pipelines:

RwoR (Vision-Only): On nine Franka+UMI gripper tasks, the method achieved average success $0.78$, closely matching the $0.82$ rate for direct UMI gripper demonstrations. Generative-model ablations demonstrate peak PSNR/SSIM ($33.8/0.86$) with full pipeline, outperforming raw-aligned and inpaint-only baselines. Generalization to unseen actions (rotate block/unstack block) yielded $0.80$–$0.83$ success; new object instances achieved $0.83$–$0.87$ (Heng et al., 5 Jul 2025).
OSMO (Tactile+Vision): In sustained contact-rich whiteboard wiping tasks, tactile+vision+proprio policies achieved $71.7\pm27.4\%$ pixel-erasure rate, exceeding vision+proprio ( $55.8\pm30.0\%$ ) and proprio-only ( $27.1\pm32.4\%$ ) baselines. Qualitative failure modes for vision-only policies included under/over-pressing and loss of contact, largely mitigated via tactile feedback (Yin et al., 9 Dec 2025).

6. Significance, Limitations, and Comparative Outlook

Glove2Robot pipelines provide data-efficient, hardware-light alternatives to robot teleoperation, supporting high-fidelity demonstration collection for imitation learning. Vision-only variants bridge the human-robot sensor gap through learned generative mappings and visual alignment, while tactile-glove approaches directly transfer continuous force signals when identical instrumentation is available for both human and robot. This suggests broad applicability across manipulation skills where high-quality visual or multimodal alignment is achievable.

A significant tradeoff exists: vision-only pipelines may require complex generative models and careful composite construction, whereas tactile-glove methods demand custom hardware but avoid vision-based force inference and inpainting. Current state-of-the-art demonstrations show that both approaches can reach near-parity with upper-bound teleoperated policies, though contact-rich task performance especially benefits from explicit tactile transfer.

Future directions plausibly include integrating multimodal generative models, adaptive policy architectures for real-time fusion of vision and touch, and open-source dissemination of hardware and software toolchains for widespread reproducibility.

Key References

Approach	Core Modality	Policy Type	Notable Metrics	arXiv ID
RwoR	Vision (wrist cam)	Diffusion, BC	Success 0.78 (UMI: 0.82)	(Heng et al., 5 Jul 2025)
OSMO	Tactile+Vision	Diffusion, BC	71.7% (tactile+vision+prop.)	(Yin et al., 9 Dec 2025)