HIDAgent Toolkit for Automated HID Control
- HIDAgent Toolkit is an open-source hardware and software system enabling personal agents to control any HID-compatible device using visual models and large language models.
- It features a modular design separating control logic from target devices, using off-the-shelf components like RP2040, HDMI capture dongles, and CH340 bridges for platform-agnostic interaction.
- The toolkit provides a unified Python API for low-level HID event injection, vision-based UI analysis, and cross-device automation, ensuring reliable performance with human-paced commands.
HIDAgent Toolkit is an open-source hardware and software system enabling automated agents to interact with any HID-compatible computing system—desktop or mobile—by emulating the physical mouse and keyboard. Designed to support research into "personal agents," HIDAgent allows UI agents powered by visual understanding and LLMs to operate computers through their native graphical interfaces, bypassing platform-specific APIs or screen-sharing protocols. By implementing a hardware intermediary that bridges the control computer and the target device, HIDAgent uniquely enables both robust platform-agnostic control and high-fidelity vision-based UI automation using only commodity components and a unified Python API (Bigham, 31 Jan 2026).
1. Architecture and Hardware Composition
HIDAgent consists of a physical hardware stack coupled with a Python library that orchestrates vision-based UI understanding and HID event injection. The architecture separates the control computer, where agent logic and vision models operate, from the target device, which could be a laptop, desktop, phone, or tablet.
The hardware employs three primary off-the-shelf components:
| Component | Role in Stack | Approximate Cost (USD) |
|---|---|---|
| RP2040 (Raspberry Pi Pico) | USB HID gadget (mouse/keyboard emulator) | $4 |
| HDMI-to-USB capture dongle | Captures 1920×1080 video frames from target device | $18 |
| CH340 USB-to-TTL bridge | Serial bridge for command relay | $6 |
Physical connectivity is established as follows:
- The target device's HDMI (or USB-C video) output is routed via the HDMI capture card to the control computer's USB interface, providing real-time screen frames.
- A CH340 bridge connects the control computer's USB to the microcontroller's UART pins, transmitting JSON-encoded control messages.
- The RP2040's USB-C port interfaces with the target device, enumerating as a standard keyboard/mouse for HID report delivery.
This arrangement enables a "personal agent" configuration in which the control computer is decoupled from the target system, maintaining functional separation and versatility (Bigham, 31 Jan 2026).
2. Python Library Interface and Workflow
The HIDAgent Python library (hidagent.py) exposes a single-module API for agent integration. The interface supports both low-level and high-level operations, allowing seamless scripting and programmatic agent design.
Key workflow components include:
- Initialization and serial port declaration:
1 2
from hidagent import HIDAgent agent = HIDAgent(serial_port="/dev/ttyUSB0", capture_device_id=None)
- Core event primitives:
get_screenshot() → PIL.Imagemove_mouse(x:int, y:int)click_mouse(x:int, y:int, button:'left'|'right')type(text:str)keypress(keys:List[str])(e.g.,["ctrl","alt","t"])
- Convenience and vision-based helpers:
recognize_gui_elements(image)(wrapper over Omniparser; returns a list of UI elements with"type","bbox","label")llm_screenshot_query(image, query, model=...)(invokes LLMs for UI reasoning)run_application(app_name:str)(cross-platform spotlight sequence)gui_diff(img1, img2)(regions changed between screenshots)patch_location(patch_img, screenshot)(find sub-templates on screen)
Underlying all event emission is a JSON serial protocol to the RP2040 (e.g., {"type":"click","x":100,"y":50}), with responses confirming success. A deliberate 0.1–0.2 s (100–200 ms) sleep is inserted between commands to align with human interaction rates and avoid disturbing OS-level pointer acceleration heuristics (Bigham, 31 Jan 2026).
3. Platform Support and Compatibility Considerations
HIDAgent targets broad compatibility across desktop (macOS, Windows, Linux) and mobile platforms (iOS, Android), leveraging HID standards and HDMI/USB-C outputs.
Platform-specific considerations:
- macOS/Windows: recognition as a standard keyboard/mouse; "new keyboard" configuration dialogs may appear but do not hinder operation.
- iOS: requires enabling Assistive Touch and "Full Keyboard Access" in Accessibility to accept synthetic HID input; HDMI mirroring functions without further configuration.
- Android: upon initial connection, the user must accept the "Video Mirroring" permission dialog.
- Bluetooth HID: not supported; all control is via USB connections.
Linux support has been validated using USB HDMI capture, and the solution is agnostic to operating system drivers due to the standard USB HID device enumeration (Bigham, 31 Jan 2026).
4. Performance Characteristics and Engineering Trade-Offs
Command latency over the USB HID interface is sub-5 ms; however, the built-in 100–200 ms inter-command delay is the dominant factor in end-to-end throughput. In practical use, scripting and agentic automation occur at "human-like" speeds, ensuring that typical GUIs are not overwhelmed or desynchronized.
The approach ensures reliable delivery of input events on all tested platforms, while also coping gracefully with pointer acceleration and anti-jitter logic present in modern OSes—a design choice to avoid erratic mouse behaviors.
High-resolution streaming (1920×1080) is supported, though some vision models (notably those in UI parsing) may degrade in accuracy above 2K resolution. Downsampling or platform-native screenshot shortcuts can be employed for higher-fidelity vision tasks. For applications requiring maximal visual accuracy, bypassing HDMI capture with OS-specific screenshot APIs is suggested as future work (Bigham, 31 Jan 2026).
5. Use Case Prototypes
HIDAgent was validated across five representative agentic use cases:
- Extensible UI Agent: Integrated prominent LLM-driven "computer-use" tools (e.g., Anthropic Claude, Gemma:27b) by mapping agent tool calls (screenshot, click, keyboard actions) to the HIDAgent API and introducing a "run_application" tool. Notably, robust agentic control is more challenging on Android due to the lack of a universal keyboard shortcut for launching apps. Vision model performance can degrade on high-resolution desktops, necessitating explicit agent configuration.
- Universal UI Data Collection: Automated random UI crawling for dataset construction; the workflow involves capturing screen states, diffing successive UIs, and annotating with Omniparser-extracted elements. HDMI captures may include unwanted overlays (e.g., mouse cursor), particularly in portrait orientations, requiring cropping or substituting with native OS screenshot functions.
- Screen Reader Anywhere: Enables external screen reading capability for arbitrary devices by converting Omniparser UI parses into accessible HTML, relayed via HTTP to a browser-based client. The non-local screen reader instance can suffer staleness, thus periodic synchronization is required for interactive consistency.
- Cross-Device Interaction: Supports session migration across heterogeneous target devices (e.g., laptop to phone) with ongoing agentic state, without code modification. Demonstrates true platform-independence for the control logic.
- The Helpful Observer: Implements proactive monitoring for specified visual patterns, coupled with user assistance (e.g., price extraction). The workflow relies on screenshot polling and LLM-powered UI queries. Early versions support CLI integration, with gesture and voice triggers marked as future directions (Bigham, 31 Jan 2026).
6. Cost, Physical Assembly, and Calibration
HIDAgent leverages commodity hardware for an aggregate component cost below $30 USD. Assembly involves:
- Wiring the CH340 TXD/RXD to the RP2040 UART RX/TX.
- Loading CircuitPython firmware onto the RP2040.
- Connecting RP2040 to the target device's USB-C.
- Routing HDMI capture output from the target device to the control computer.
- Connecting the CH340 bridge to the control computer.
- Installing the hidagent Python package and configuring the CH340 port.
Calibration aligns virtual HID coordinates with physical screen pixels via an automated two-point routine. The agent moves the pointer between two reference points (e.g., $(100,100) \to (200,100)$ and vertically), then computes scaling factors by comparing screenshots. Movements are subdivided into 10 px increments to mitigate OS pointer acceleration (Bigham, 31 Jan 2026).
7. Open Challenges and Prospective Directions
Identified future research avenues include:
- Context-Crawler: Enriching agent perceptual context by programmatically querying system settings, connected hardware, and network state.
- Self-Contained Control Unit: Migrating control logic from a full laptop to compact devices (e.g., Raspberry Pi 4, smartphones), supporting mobile research scenarios.
- Platform Screenshot APIs: Integrating native screenshot capture with HDMI streaming for artifact-free, cursorless imagery at maximal resolution.
- Latency Optimization: Tuning inter-command delays on a per-platform basis to minimize agentic action time while preserving robustness.
- Security and Ethics: Implementing safeguards such as user-presence detection, biometric verification, and system hardening to prevent misuse, especially on locked-down or sensitive systems.
- Expanded Connectivity: Adding Bluetooth HID capability to extend support to devices lacking available USB-C ports.
All hardware schematics, assembly instructions, and example code are provided in the toolkit's open-source repository (Bigham, 31 Jan 2026).