iCub3 Avatar System: Modular Teleoperation
- iCub3 Avatar System is a modular cybernetic teleoperation platform featuring a full-scale 54 DoF humanoid robot with advanced multimodal sensing and actuation.
- It integrates operator wearables with whole-body inverse kinematics and QP solvers to achieve low-latency, high-fidelity remote locomotion and manipulation.
- Field deployments validate its robust performance in haptic feedback and remote embodiment, while highlighting challenges in network stability and operator cognitive load.
The iCub3 Avatar System constitutes a modular cybernetic teleoperation platform, centering on the full-scale, 54 DoF humanoid iCub3, designed for high-bandwidth, low-latency remote embodiment. Developed by the Istituto Italiano di Tecnologia (IIT), the system allows a remotely located human operator to execute whole-body locomotion, dexterous bimanual manipulation, speech, facial expression, and receive rich multimodal sensory feedback, including stereoscopic vision, binaural audio, haptic, and proprioceptive cues. Its capabilities have been validated in geographically distributed, real-time robot embodiment sessions, including public deployments at the Venice Biennale and We Make Future, and in the ANA Avatar XPrize competition, integrating architecture variants adapted to scenario-specific requirements (Dafarra et al., 2022).
1. Hardware Architecture and Innovations
iCub3 marks a systematic redesign over previous iCub iterations, addressing scale, mechanical stiffness, joint actuation, and multi-modal sensing. The robot's physical dimensions are 125 cm in height and 52 kg in mass (compared to 104 cm/33 kg for earlier versions), with a weight distribution of 45% legs, 20% arms, and 35% torso/head. The robot uses rigid serial drives at the shoulder and torso, replacing the prior tendon-driven configuration, thereby increasing reliability and motion range. Actuation relies on brushless three-phase motors (110–179 W, 0.18–0.43 Nm rated, via 1:100 or 1:160 reduction) for primary axes, whereas DC motors actuate the eyes, neck, wrists, and fingers.
The full 54 DoF breakdown includes 4 (eyes/eyelids), 3 (neck), 3 (torso), 14 (arms), 12 (legs), and 18 (hands). End-effectors retain tendon-driven, individually actuated fingers for dexterous manipulation. Sensor configuration comprises eight six-axis force/torque units (F/T-45 and F/T-58) with integrated IMUs, capacitive tactile arrays ("artificial skin") on upper limbs and finger tips (48 taxels per palm), dual 4K Basler cameras in the eye bulbs (processed by NVIDIA Xavier NX), a RealSense D435i depth sensor in the torso, microphones (binaural), a speaker, and software-addressable RGB facial-expression LEDs (Dafarra et al., 2022).
On-board computation leverages an Intel Core i7-1165G7 (Ubuntu 20.04) with 16 GB RAM and the Xavier NX GPU, while distributed Cortex-M4 microcontrollers network over Ethernet perform real-time motor control. Absolute joint encoders (18-bit) and motor-shaft optical encoders underpin robust feedback.
2. Modular System Design and Operator Interface
The iCub3 Avatar System is architected as three principal modules: (1) operator-side retargeting and feedback interface; (2) communication layer (YARP middleware, OpenVPN); (3) robot-side control.
Operator-side wearables consist of:
- HTC VIVE Pro Eye HMD (with eye and facial tracking at ~90 Hz),
- SenseGlove DK1 exoskeleton gloves (finger motion capture, vibro-tactile and brake-based haptic feedback per fingertip, up to 20 N),
- "iFeel" IMU-based suit with joint orientation tracking (70 Hz), F/T-based shoe sensors for body-weight feedback, and vibro-haptic nodes for touch/tactile mapping,
- Two VIVE Trackers (for foot pose in "iFeel Walking" mode),
- Cyberith Virtualizer Elite 2 omnidirectional treadmill for immersive walking input.
Retargeting algorithms fuse HMD, tracker, IMU, and glove data into desired link velocities (SE(3)). Manipulation control passes these velocities into a QP-based whole-body inverse kinematics pipeline. Locomotion input uses the Virtualizer to extract unicycle reference velocities (forward/rotation) via discrete gait intention triggers, or maps foot/waist tracker and shoe F/T data to a unicycle-with-lateral-extension model. Audio and facial expressions are relayed with negligible lag.
Feedback modalities include compressed stereoscopic video from robot eyes (~25 ms network latency), uncompressed audio, fingertip haptic feedback synchronized with robot skin/F/T events, and body-haptic cues mapped to arm contact and force events (Dafarra et al., 2022).
3. Communication, Control, and Retargeting Algorithms
The real-time network stack operates on YARP middleware over a private LAN, with OpenVPN tunneling across wide-area networks. Critical data flows utilize UDP (images/haptics) and TCP (poses), achieving <25 ms core-to-core bi-directional latency under fiber connections, and maintaining operation with occasional spikes >100 ms under adverse Wi-Fi conditions (mitigated by optimized bitrate allocation).
Robot-side control adopts a three-layered architecture: A) Trajectory optimization for walking (unicycle model: , ), B) Simpified-model CoM/ZMP feedback generators (LIPM+DCM, Eq S5), C) Whole-body QP solver (OSQP-Eigen, 1 kHz) for joint velocities enforcing hard foot/CoM constraints () and weighted soft tasks (e.g., retargeting, regularization) (Dafarra et al., 2022).
Inverse differential kinematics map operator sensor measurements (position, orientation, IMU gravity, fingertip angle) into desired link velocities using proportional feedback (, ), subject to joint limits (). Locomotion and manipulation signals are implemented as weighted tasks in the QP.
Haptic and force feedback propagate via vertical force readings from arm F/T sensors (mapped to body-node vibration proportional to ) and discrete tactile taxel activations yielding rapid vibration pulses in the corresponding body area. Touch event–haptic mapping and CoM/ZMP regulation are tightly integrated in the feedback loop to ensure operator embodiment fidelity.
4. Scenario-Based Deployments and Quantitative Outcomes
System validation occurred across four primary deployments:
- Biennale di Venezia (290 km, fiber link): Expert operator performed long-distance walking and embodiment using full sensor suite; <25 ms round-trip latency, objective of full-bodied interaction and facial expressivity retargeting.
- We Make Future (300 km, fiber): Emphasis on collaborative load carrying (0.5 kg payload) and participatory audience interaction, with safety-constrained operator input, and balance-prioritized control.
- ANA Avatar XPrize Semifinals (local): Novice operators, local networking, 0.5 cm puzzle placement, 1 kg object weighing, and tactile classification (AlexNet-based CNN, 78% test accuracy), achieving CoM tracking RMSE <2 cm and overall score 95/100 (second place).
- ANA Avatar XPrize Finals (local Wi-Fi): Use of "iFeel Walking" (to support lateral locomotion), dynamic hand modifications for specific payloads (e.g., drill, canisters), variance in network stability (occasional audio/video loss, latency >100 ms), and final ranking 14th due to a stability-related fall (Dafarra et al., 2022).
Key performance metrics:
- Hand Cartesian RMS error 3–5 cm (peak 8 cm),
- Retargeting lag ~0.5 s,
- CoM tracking RMSE ~1.5 cm over 2 m walking,
- Bi-directional network latency <25 ms (fiber) and <30 ms (Wi-Fi, with spikes),
- Arm F/T force estimation noise ±5 N.
Qualitative assessments included consistent operator reports of high presence and embodiment, with judges noting the system's intuitiveness, responsiveness, and expressive capacity.
5. Comparative Approaches and Control-Theoretic Insights
Complementary research on telemanipulation with avatar systems using similar operator feedback architectures demonstrates the benefits of decoupling operator and robot arm kinematics through common Cartesian hand frames, enabling bimanual mapping even across disparate arm chains (Lenz et al., 2023). Cartesian impedance control and predictive, operator-side inverse-kinematics models preempt round-trip delay and enable low-drift operation. Haptic force feedback, provided via high-rate (kHz) F/T sensor readings and admittance-to-torque mappings, augments awareness and safety.
Evaluations point to minimal tracking jitter (6 mm at 44 ms total round-trip delay), and user studies demonstrate comparable task completion rates and subjective handling impressions whether force feedback is present or not, with improved subjective ratings on object handling and finger control when feedback is enabled. Lessons underscore the sufficiency of straightforward impedance and predictive modeling for stability, while flagging the need for further latency reduction (FPGA/RT-EtherCAT), expanded haptic feedback dimensionality, and whole-body predictive dynamics (Lenz et al., 2023).
Alternative mixed real/virtual avatar architectures, such as those implementing augmented reality overlays of a remote 3D human model onto the robot (via ONIA—Optimal Non-Iterative Alignment), reveal that stateless, O(1) geometric alignment outperforms iterative Jacobian/FABRIK solvers in speed and reliability, with deterministic, singularity-free correspondence (ONIA: 0 mm deviation in overlay, ~0.02 ms/solve). Such alignment algorithms, though demonstrated on upper-limb tasks, indicate the value of stateless, closed-form alignment and multi-metric evaluation frameworks for any iCub3-analogous telepresence system (Tejwani et al., 2023).
6. Limitations, Design Challenges, and Future Directions
Identified limitations include high operator cognitive load due to simultaneous head/hands/feet/face control, perceptible video lag (25–100 ms), asynchrony between visual and haptic signals, and the fragility of complex tendon-driven hands to unexpected payloads and tasks. Wi-Fi operation exposes the system to intermittent network instability, and reliance on small antennas restricts robustness. Balance supervision only spans minimal recovery/autonomy, highlighting the need for more sophisticated proactive stability control (e.g., LiDAR-based obstacle avoidance).
Advancing the platform will require enhancing autonomous stability features, expanding the haptic feedback loop, automating hand adaptation, and integrating smarter sensor network management for adverse network conditions. An explicit design implication is that modularization—distinct separation of motion control, pose synchronization, geometric alignment, and feedback rendering—facilitates resilience, testability, and augmentability (Dafarra et al., 2022, Lenz et al., 2023, Tejwani et al., 2023).
7. Open-Source Tools and Research Impact
All system software components—bipedal locomotion framework, whole-body QP solvers (osqp-eigen), data logging/monitoring utilities—are open source. The technical advances realized in the iCub3 Avatar System establish a rigorous platform for in-depth investigation into immersive, human-in-the-loop cyberphysical systems, benchmarking remote embodiment fidelity, latency management, and multimodal feedback strategies. By systematically bridging physical-capable humanoid robotics with operator-centric immersive interfaces, iCub3 supports robust experimental and applied research in remote intervention, social robotics, and telepresence (Dafarra et al., 2022).