AhaRobot: Open-Source Robotic Ecosystem

Updated 1 December 2025

AhaRobot is an open-source, dual-arm mobile manipulator designed for cost-effective embodied AI research.
Its hardware features include a differential-drive base, foldable SCARA arms with precision sensors, and robust teleoperation capabilities.
The integrated software ecosystem employs ROS 2 control and transformer-based behavior cloning for real-time multimodal reasoning and autonomous operations.

AhaRobot is a robotic system and hardware-software ecosystem developed for embodied AI research, characterized by accessibility, modular integration, and technical versatility across mobile manipulation, teleoperation, end-to-end learning, and online multimodal reasoning. There are two principal lines of work under the AhaRobot designation: (1) the open-source hardware platform for bimanual mobile manipulation (Cui et al., 13 Mar 2025) and (2) robotic integration of real-time vision-language highlight detection leveraging the Aha framework (Chang et al., 19 Sep 2025). Both facilitate large-scale, cost-effective experimentation in open-world settings.

1. Hardware Architecture and Mechanical Design

AhaRobot is designed as a low-cost, open-source dual-arm mobile manipulator with a total hardware bill of materials near \$1,000, approximately 1/15 the cost of prevalent commercial platforms (Cui et al., 13 Mar 2025). Its architecture comprises:

Mobile Base: Differential-drive with two front BLDC wheels (Hall-effect encoders, 64 counts/rev), a rear omni-wheel, and a chassis built from T-slot aluminum extrusion. Motor control is handled by ODrive 3.6.
Lifting Mechanism: Vertical actuation via a belt-driven linear slide, featuring pulley sets and photointerrupter-based homing.
Manipulators: Two horizontal, foldable SCARA-like arms, each with 3 DoF (shoulder slide, elbow, wrist) and a 2-finger parallel gripper. Each joint employs dual Feetech STS3215 DC gear motors for anti-backlash control, with maximal torque of 35 kg·cm and payload capacity of 1.5 kg per arm. Maximum reach is 750 mm (XY), Z-reach with lift is 1,250 mm.
Sensing: Proprioception via magnetic encoders (4,096 cpr) and Hall sensors. Vision via three CMOS cameras (640×360 @30 Hz) distributed across head and wrists. The head camera is mounted on a 2-DoF pan-tilt gimbal.
Computation and Power: Optional onboard compute via an Intel i5-12700KF/NVIDIA RTX4060 Mini-ITX (ROS 2 Humble), distributed control by five ESP32 modules for PID motion profiling, powered by a 20 Ah/24 V Li-Po (≈294 Wh) for actuators and a separate 1 kWh Jackery AC unit for computing.
Safety: Hardware emergency stop circuit.

A breakdown of component costs is summarized in the table below.

Component	Quantity	Subtotal (\$)
Feetech STS3215 DC motors (gearbox)	12	180.0
Grippers (2-finger)	2	50.0
Linear belt/pulleys (lift)	1	40.0
Chassis parts (T-slot aluminum profiles)	—	150.0
ODrive 3.6 motor controller	1	65.0
ESP32 control modules	5	50.0
Magnetic encoders	8	100.0
Hall-effect sensors (base)	2	16.0
Photoswitch homing sensor	1	5.0
Cameras (640×360, 30 Hz)	3	60.0
Pan-tilt gimbal	1	40.0
Wheels & motors (base)	2	70.0
Omni-wheel	1	15.0
Battery (20 Ah/24 V Li-Po)	1	80.0
Wiring, connectors, hardware fasteners	—	40.0
Total	—	1,001.0

All mechanical CAD, firmware, and control stacks are freely available via aha-robot.github.io.

2. Control Systems and Teleoperation Workflow

AhaRobot incorporates control solutions optimized for low-cost actuation, high positional precision, and robust teleoperation:

Dual-Motor Backlash Elimination: Each joint is actuated by a pair of motors with preloading in opposing directions, ensuring zero clearance via feed-forward bias:

$\begin{cases} u_1 = u_o + u_b \ u_2 = u_o - u_b \end{cases}$

where $u_o$ is the base voltage, $u_b$ is the bias (Cui et al., 13 Mar 2025).

Static Friction Compensation: Motor dithering is applied to overcome Coulomb friction:

$u_d = (-1)^{\lfloor t/T \rfloor} \, u_b$

with $T$ as cycle period, which maintains proximity to the joint’s breakaway threshold.

Teleoperation (RoboPilot): Handles are tracked via AprilTag markers affixed to a 26-faced polyhedral shell, affording 6-DoF pose detection. Four Hall-effect foot pedals enable mode switching between base and arm control, mapped to robot Cartesian and gripper actions through inverse kinematics in the ROS 2 stack. The workstation cost is approximately \$50.

Teleoperation Workflow:

Client runs AprilTag detection in WebAssembly/OpenCV.js for handle tracking.
Pedal states are streamed via ESP32/WebSerial.
Video feeds across all three cameras are transmitted over WebRTC for real-time monitoring, while control commands are fed to the robot over a ROS 2 interface.

Empirical studies show RoboPilot achieves a task completion time reduction of approximately 30% over leader-follower and 3D SpaceMouse approaches, with 100% success on benchmark tasks such as dish placement and drawer manipulation.

3. End-to-End Policy Learning and Autonomous Operation

AhaRobot is capable of supporting fully autonomous manipulation via demonstration-driven behavior cloning:

Demonstration Data: Dense logging of state-action pairs across tasks (50–80 demonstrations per task, up to 18-dimensional state and action vector).
Modeling: ACT imitation learning uses causal transformers $\pi_\theta$ to predict actions from state histories:

$\mathcal{L}(\theta) = \sum_{t} \left\| a_t - \pi_\theta(s_{t-k:t}) \right\|_2^2$

Deployment: Trained policies are executed on the onboard Mini-ITX/RTX4060 at a 30 Hz control loop.

Task generalization has been validated for multi-stage operations—such as “Insert Pen” and “Collect Toy”—with success rates per sub-stage detailed in experimental results.

4. Real-Time Highlight Detection and Vision-Language Reasoning

AhaRobot also denotes the robotic deployment of the Aha framework for online multimodal highlight detection (Chang et al., 19 Sep 2025). This system uses continuous video streams and natural language task input to flag task-relevant events using the following architectural elements:

Vision Encoder: Frozen SigLIP–LargePatch16 encoder yields frame-wise embeddings at 1 fps.
Multimodal Projection: Mapping visual features $v_t$ into LLM embedding space $x_t = W_p v_t$ .
Autoregressive Decoder (Qwen2): Primed with system and task tokens $(S, Q)$ ; new frame tokens are appended in real time.
Dynamic SinkCache: Maintains a fixed set of prompt tokens $Q$ and a sliding window (last $n$ tokens, e.g., $n=2048$ ), guaranteeing constant memory and strict online processing.

The system comprises three prediction heads:

Relevance Head: Linear projection, trained with smooth L1 and total variation losses.
Informativeness Head: Softmax over new/redundant frame classes.
Uncertainty Head: Log-variance regression, enabling confidence estimation and fail-safe event handling.

Highlight scores are fused via a piecewise function:

$\hat y_t = \begin{cases} \alpha \hat i_t + \beta \hat r_t, & \hat u_t \le \tau_u \ \alpha \hat i_t + \beta \hat r_t - \epsilon (\hat u_t - \tau_u), & \hat u_t > \tau_u \end{cases}$

with default heuristics given for $(\alpha, \beta, \epsilon, \tau_u)$ .

Workflow:

Each incoming video frame undergoes embedding, SinkCache update, autoregressive decoding, score prediction, and—if flagged as a highlight—automatic invocation of downstream planners (object grasping, path switching) or operator alert.

5. Experimental Validation and Metrics

AhaRobot’s efficacy is supported by empirical studies:

Teleoperation Performance (Cui et al., 13 Mar 2025):
- RoboPilot: 100% success, 45.22 s avg time; Leader-Follower: 100%, 65.33 s; 3D SpaceMouse: 81.5%, 58.71 s.
- Manipulation precision: 26-faced handle tracking yields 2.1 mm/1.09° accuracy (vs. 9.9 mm/5.4° for cubic handle).
- Long-horizon tasks completed remotely, demonstrating robust networked control for >200 m base motion.
Highlight Detection and Reasoning (Chang et al., 19 Sep 2025):
- TVSum: mAP = 91.6 (zero-shot), 93.0 (domain tuned; SOTA).
- Mr.HiSum: mAP @ 50 = 64.19 (+8.3 over prior SOTA).
- SCOUT (robotic navigation): 16/18 predictions aligned with human operator instructions.
- Real-time streaming with robust event identification and operator support.
End-to-End Learning:
- Autonomous manipulation achieves high per-stage and per-task success rates using transformer-based behavior cloning.

6. Software Resources, Open-Source Availability, and Extensibility

All elements—CAD files, parts lists, embedded control firmware, ROS 2 drivers, and teleoperation clients—are publicly available:

Documentation and build guides: https://aha-robot.github.io
Hardware/assembly: https://github.com/aha-robot/ahabot-hardware
ROS 2 control stack: https://github.com/aha-robot/ahabot-control
RoboPilot client/ESP32 firmware: https://github.com/aha-robot/robopilot

The Editor's term "AhaRobot ecosystem" refers to the full hardware/software stack, enabling replication for roughly \$1,000 (robot) + \$50 (teleop).

7. Extensibility, Limitations, and Future Directions

AhaRobot offers modularity for embodied AI research, but some technical constraints remain:

Backlash and friction compensation are tuned for brushed DC motors; transition to alternative actuation (e.g., stepper, servo) may require control redesign.
Limitations in failure taxonomy (seven failure modes, as per the AHA VLM) preclude complex multi-object or deformable manipulations (Duan et al., 1 Oct 2024).
The open-source platform supports further extension in teleoperated failure logging, adversarial data generation, dexterous manipulation, and continual learning.
Online highlight detection can be further adapted to new robotics modalities or real-time planning modules.

A plausible implication is that AhaRobot systems facilitate scalable real-world embodied AI experiments, bridging the data and deployment gap for end-to-end learning, multimodal reasoning, and robust manipulation in dynamic environments.