- The paper introduces a multi-agent framework that decomposes natural language instructions into executable code for the Misty social robot.
- The paper refines 136 APIs and employs a dual-layer self-reflective feedback loop to enhance code accuracy and task success rates.
- The paper demonstrates superior performance over baseline models by achieving a 100% task completion rate across 28 tasks of varying complexity.
The paper "AutoMisty: A Multi-Agent LLM Framework for Automated Code Generation in the Misty Social Robot" (2503.06791) introduces a framework designed to enable non-programmers to generate executable code for the Misty social robot using natural language instructions. The core problem addressed is the inaccessibility of robot programming, despite open APIs, for users without technical coding skills.
AutoMisty proposes a multi-agent collaboration framework powered by LLMs to tackle this. The framework consists of four specialized agent modules:
- Planner Agent: Analyzes the user's high-level instruction, decomposes it into manageable subtasks, creates a plan with an execution order, and assigns subtasks to the appropriate specialized agents. It also incorporates a human-in-the-loop step for plan validation.
- Action Agent: Handles tasks related to the robot's physical movements and actions.
- Touch Agent: Manages tasks triggered by or involving the robot's touch sensors.
- Audiovisual Agent: Deals with tasks involving audio processing (like speech recognition via Whisper) and visual processing, enabling the robot to "see" and "hear".
A key aspect of AutoMisty's implementation is the optimization of the Misty APIs. The authors found that the original APIs had issues like unclear descriptions and missing parameters, leading to LLM errors. They refined and restructured 136 APIs, adding comprehensive documentation (input/output formats, parameters, scenarios) to improve LLM comprehension and reduce hallucinations. These optimized APIs are then specifically assigned to the relevant subtask agents.
Each agent within the framework incorporates a two-layer optimization mechanism for robust code generation:
- Layer 1 (Self-Reflective Feedback): Involves a Critic-Designer interaction where the Designer agent proposes solutions (using In-Context Learning with optimized APIs) and the Critic agent evaluates them for accuracy and feasibility. This forms an iterative loop for refinement until the Critic approves the output.
- Layer 2 (Human-in-the-Loop): After Layer 1 approval, the generated output is presented to the user via a Drafter module. The user can provide feedback, triggering Layer 1 for further refinement if necessary, until the user is satisfied.
The framework also includes a memory module to store successful task executions and user preferences, improving teachability and future code generation consistency.
For practical deployment, AutoMisty uses a system verification step. Generated MistyCode is first checked in a local environment. If compilation or runtime errors occur, details are fed back to the Designer agent for debugging. Once the code passes local verification, it's sent to the Misty robot for execution. A User Proxy mechanism (inspired by AutoGen (Hossen et al., 29 Apr 2024)) allows the user to monitor the robot's real-world performance and provide feedback for further adjustments.
To evaluate AutoMisty, the authors created a benchmark of 28 tasks classified into four complexity levels: Elementary, Simple, Compound, and Complex. They compared AutoMisty against direct use of ChatGPT-4o and ChatGPT-o1 using metrics like Task Completion (TC), Number of User Interactions (NUI, broken down into User Preference (UPI) and Technical Correctness (TCI) interactions), Code Efficiency (CE), and User Satisfaction (US).
Experimental results showed that while all models performed well on Elementary and Simple tasks, AutoMisty significantly outperformed the direct LLM baselines on Compound and especially Complex tasks. AutoMisty achieved a 100% task completion rate across all complexity levels, whereas ChatGPT-4o failed on many complex tasks and ChatGPT-o1 failed on some. AutoMisty demonstrated higher robustness and adaptability. Ablation studies confirmed the importance of the Self-Reflective Feedback mechanism, showing improved task success rates on complex tasks when it is enabled, despite a slight increase in interaction count for those tasks. The teachability assessment showed the system's ability to learn and correctly retrieve previously saved user preferences for emotions, though some misclassifications occurred.
The authors highlight that AutoMisty's approach, leveraging optimized APIs and in-context learning within a multi-agent framework, provides strong generalization capabilities and allows for low-cost migration to other API-driven social robots. The framework effectively lowers the technical barrier for programming social robots, making customization accessible to non-technical users through natural language conversation. Future work aims to extend the framework to handle collaboration between multiple robots and humans.