Papers
Topics
Authors
Recent
2000 character limit reached

Olaf’s Mechatronic & RL Architecture

Updated 20 December 2025
  • Olaf’s Mechatronic and RL Architecture is a system that combines custom multi-DoF design with hierarchical RL control to produce stylized, safe motion.
  • It employs a sim-to-real pipeline with dense sensor integration and PD control, ensuring actuator protection and adherence to nonlinear constraints.
  • The framework uses barrier functions and imitation rewards to balance animation fidelity with rigorous thermal and mechanical safety in underactuated dynamics.

Olaf’s Mechatronic and Reinforcement Learning (RL) Architecture integrates advanced mechanical design with state-of-the-art RL-based motion control. The system is engineered to instantiate highly stylized, animation-driven behaviors while ensuring mechanical robustness and safety across nonlinear hardware constraints. Core architectural elements include a fully custom leg mechanism, multi-DoF joint assemblies, dense sensor integration, thermal management, a hierarchical RL control policy shaped by barrier functions and imitation rewards, and a sim-to-real pipeline for policy transfer. This platform demonstrates control of 25 DoFs under nonlinear underactuated dynamics, achieving animation reference fidelity and actuator-protective constraint satisfaction in real hardware (Müller et al., 18 Dec 2025).

1. Mechatronic Structure and Kinematics

The mechanical skeleton contains 25 DoFs: 6 per leg (asymmetric design for collision avoidance), 3 in the neck (supporting an oversized head with critical thermal load), 4 in the eyes (yaw, pitch, eyelid, linkage coupling), and additional DoFs in the jaw, shoulders, and eyebrow. All actuators and linkages are hidden by a composite skirt of soft PU-foam (25 mm, deformable) and stretch fabric. Leg kinematics use Denavit–Hartenberg parameters:

$\begin{array}{c|c|c|c|c} \text{Joint}\ i & \alpha_i & a_i & d_i & \theta_i \ \hline 1\ (\text{hip\_roll}) & -\frac{\pi}{2} & 0 & 0 & \theta_1 \ 2\ (\text{hip\_yaw}) & +\frac{\pi}{2} & 0 & 0 & \theta_2 \ 3\ (\text{hip\_pitch}) & -\frac{\pi}{2} & L_1 & 0 & \theta_3 \ 4\ (\text{knee}) & 0 & L_2 & 0 & \theta_4 \ 5\ (\text{ankle\_pitch}) & +\frac{\pi}{2} & 0 & 0 & \theta_5 \ 6\ (\text{ankle\_roll}) & 0 & 0 & L_3 & \theta_6 \end{array}$

where L1=0.12L_1=0.12 m, L2=0.11L_2=0.11 m, L3=0.045L_3=0.045 m. Shoulder joints utilize spherical 5-bar linkages parameterized for fast lookup-driven inverse kinematics. Eyes and jaw employ planar 4-bar linkages, with kinematic and torque mappings given by:

θp=arccos(p2+c2d2(θa)2pc),d(θa)=e2+f22efcosθa\theta_p = \arccos\left(\frac{\ell_p^2 + \ell_c^2 - d^2(\theta_a)}{2\ell_p \ell_c}\right), \quad d(\theta_a) = \sqrt{\ell_e^2 + \ell_f^2 - 2\ell_e \ell_f \cos\theta_a}

τff(qj)=c0+c1qj+ccoscos(qj)\tau_{\rm ff}(q_j) = c_0 + c_1 q_j + c_{\cos} \cos(q_j)

Motors selected are Unitree 110 W brushless (legs, neck) and Dynamixel MX-64 (shoulders, eyes, jaw, eyebrows), regulated for torque and thermal budget. Embedded heat sinks and convective venting address critical actuation regions; temperature sensors run at 600 Hz (downsampled to 50 Hz for policy input).

2. State and Action Spaces for RL Control

The RL policy operates on a comprehensive state vector:

st=(ptP,θtP,vtR,ωtR,qt,q˙t,at1,at2,Tt,ϕt)s_t = (p_t^P, \theta_t^P, v_t^R, \omega_t^R, q_t, \dot{q}_t, a_{t-1}, a_{t-2}, T_t, \phi_t)

encompassing world-frame pose/orientation, root-frame translational/rotational velocities, 15 joint positions/velocities (legs and neck), last two action vectors, neck-motor temperatures, and gait-phase (walking). All inputs are normalized to [1,1][-1,1] by running mean subtraction and fixed range rescaling:

x~=clip(xμxRx,1,1)\tilde{x} = \operatorname{clip}\left(\frac{x - \mu_x}{R_x}, -1, 1\right)

Policy actions are 15-dimensional joint setpoints, mapped to actuator torque commands via PD control:

atR15,τt=Kp(atqt)Kdq˙ta_t \in \mathbb{R}^{15}, \quad \tau_t = K_p(a_t - q_t) - K_d \dot{q}_t

Peripheral DoFs (eyes, jaw, shoulders) are under independent control.

3. Policy Network Topology and Training Pipeline

Two separate stochastic actor-critic networks (for standing vs walking) are employed:

  • Actor: 3 hidden layers, 512 ReLU units each; input dimension ≈60, output dimension 15.
  • Critic: identical topology, augmented with privileged simulation data (perfect state, friction, terrain samples).

Training uses PPO (clipping ϵ=0.2\epsilon=0.2, GAE λ=0.95\lambda=0.95), learning rate 3×1043\times 10^{-4}, batch size 4,096, 4 epochs per update, and up to 8,192 parallel simulation environments (Isaac Sim, PyBullet contacts) on a single RTX 4090. Convergence criteria: stabilized losses and imitation reward plateau within ±1%\pm1\%.

Simulation incorporates domain randomization: mass ±10%\pm10\%, inertia ±15%\pm15\%, damping ±20%\pm20\%, friction μ[0.3,0.8]\mu\in[0.3,0.8], and external force perturbations. Actuator thermal model parameters (α\alpha, β\beta) are fitted on 20 min of hardware data.

4. Reward Functions and Barrier Constraints

The per-step reward signal combines multiple weighted objectives:

rt=wimprimitation+wregrregular+wlimrlimits+wimpactrimpactr_t = w_{\rm imp}\,r^{\rm imitation} + w_{\rm reg}\,r^{\rm regular} + w_{\rm lim}\,r^{\rm limits} + w_{\rm impact}\,r^{\rm impact}

Imitation rewards penalize deviation from reference animation for both pose and joint configuration:

$r_x = \begin{cases} \exp(-k_x \|x - \hat{x}\|^2) & \text{torso/velocities} \ -\|q - \hat{q}\|^2 & \text{joint angles} \ \mathds{1}[c = \hat{c}] & \text{contact} \end{cases}$

Critical barrier control functions enforce thermal and joint constraints. For each neck motor ii:

hT(Ti)=TmaxTi0h_T(T_i) = T_{\max} - T_i \ge 0

h˙T+γThT=T˙i+γT(TmaxTi)0\dot{h}_T + \gamma_T h_T = -\dot{T}_i + \gamma_T (T_{\max} - T_i) \ge 0

rtemp=imin(T˙i+γT(TmaxTi),0)r^{\rm temp} = -\sum_i |\min(-\dot{T}_i + \gamma_T(T_{\max} - T_i), 0)|

Joint limits are penalized for excursions beyond safety margins. Impact reward terms reduce contact noise by damping excessively large foot velocity changes.

5. Simulation-to-Hardware Policy Transfer

A sim-to-real pipeline leverages system identification, domain randomization, and calibrated physical hardware to ensure transferability. The simulation model features accurate actuator equations, randomized contacts, thermal state propagation, and noise injection. On hardware, motor zero positions are calibrated by magnetic endstops, IMU-to-base transforms by Vicon motion capture, and software torque/velocity limits are set to 90% of hardware spec.

Policies are executed on an onboard Intel i7 CPU at 50 Hz policy rate, upsampled to actuators at 600 Hz using first-order hold and smoothed using a 37.5 Hz low-pass filter. Real-time monitoring enforces thermal caps ($T_i<85\,^\circ$C emergency stop, $T_i<75\,^\circ$C torque clamping), and mechanical venting extracts >1 W per motor at $\Delta T=20\,^\circ$C. Communication with Unitree motors uses EtherCAT; Dynamixels use TTL half-duplex.

6. Performance Evaluation and Operational Results

Metrics recorded on hardware include mean joint-tracking error (4.0° ± 2.0°, 5 min walking), footstep sound reduction (−13.5 dB with impact reward), and neck-motor temperature bounded at $T_{\max}=80\,^\circ$C across extended operation. Safety mechanisms (torque/velocity caps, emergency stop triggers) ensure that actuator thermal and load constraints are not violated, substantiating real-world deployability. Policy-driven stylized gait, expressive motions, and noise reduction were reliably achieved; the architecture supports both reference fidelity and robustness to physical constraints (Müller et al., 18 Dec 2025).

A plausible implication is that the multi-objective RL framework and barrier-driven constraint formulation enable physically realized agents to match animation inspiration while maintaining operational safety in the presence of severe mechanical and thermal nonlinearities. The approach scales to complex multi-DoF actors with rich body designs and interaction-driven requirements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Olaf’s Mechatronic and RL Architecture.