Multi-Agent & Human–Robot Interactions

Updated 1 April 2026

Multi-agent and human–robot interaction modes are defined frameworks that enable coordinated collaboration, negotiation, and role dynamics in shared environments.
These systems utilize diverse architectures such as FSMs, hierarchical planning, and decentralized algorithms to ensure real-time safety, scalability, and precision.
Applications span industries, demonstrating improved task success, reduced human workload, and enhanced communication through multimodal interfaces.

Multi-agent and human–robot interaction modes encompass the architectural, algorithmic, and behavioral protocols that govern how humans and robots, often in pluralities, collaborate, coordinate, negotiate, or compete in shared environments. The scope includes structured interaction policies, real-time control laws, role arbitration mechanisms, communication protocols, and multimodal user interfaces that enable scalable, transparent, and robust team behaviors.

1. Taxonomy of Multi-Agent and Human–Robot Interaction Modes

The literature delineates several fundamental axes for characterizing interaction modes among humans and robots:

Team Size and Composition: Systems range from dyadic (1H–1R) to large-scale (nH–mR), with either homogeneous or heterogeneous agent types. This impacts interoperability and protocol uniformity (Dahiya et al., 2022).
Interaction Model: Modes are classified as one-to-one, one-to-many, many-to-many, or broadcast. For example, peer-to-peer negotiation, centralized group command, and leader–follower architectures are all common [(Dahiya et al., 2022); (Regal et al., 2024); (Xin et al., 2024); (0908.2661)].
Initiative Regimes: Classic models distinguish between robot-initiated, human-initiated, and mixed-initiative modes. Mixed-initiative architectures allow flexible negotiation and dynamic role assignment throughout task execution (Yu et al., 7 Aug 2025).
Physical-Collaboration Regimes: Coexistence (independent but mutual-aware tasking) and cooperation (direct, often physical, interaction) represent distinct control paradigms, each with tailored safety and compliance requirements (Huang et al., 2022, Huang et al., 2023).
Interaction Modalities: Multimodal systems combine speech, gesture, gaze, touch, direct teleoperation, shared AR/VR workspaces, and indirect interface-mediated channels (Hasan et al., 24 Mar 2026, Wang et al., 2020, Patel et al., 2021, Qiu et al., 2020).
Control and Planning Hierarchies: Decentralized agent-centric policies, centralized coordinators or planners, hierarchical cognition-to-control stacks, and Markov decision process formulations underpin the variety of interaction schemas (Zhang et al., 4 Mar 2026, Sun et al., 30 Nov 2025, Wang et al., 12 Mar 2025).

This taxonomy enables precise specification, comparison, and design of multi-agent HRI systems across domains.

2. Formal Architectures and Control Frameworks

A spectrum of algorithmic and architectural patterns supports scalable human–robot teamwork:

Finite-State Machines (FSMs) for interaction mode switching, e.g., Coexistence $\leftrightarrow$ Cooperation, using intention tracking and event triggers (Huang et al., 2022).
Hierarchical Planning with distinct layers for perception, deliberative skill selection (e.g., System-2 MARL as Markov potential games), and real-time control (whole-body QP) (Zhang et al., 4 Mar 2026).
Multi-Agent Actor–Critic and Dec-POMDPs for social navigation, balancing decentralized robot autonomy with centralized/global constraint enforcement through critics and entropy-based fusion (Wang et al., 12 Mar 2025).
Multi-Agent Federated Learning deploying LfD frameworks across robot-edge nodes, with local updates, global aggregation, per-human profile weighting, and transfer/multi-task regularizers for cross-robot skill and knowledge sharing (Papadopoulos et al., 2020).
Role-Assignment via Optimization: Utility-maximizing integer programming assigns subtasks/roles based on agent capabilities, human labor costs, and robot confidence thresholds [(Sun et al., 30 Nov 2025); (0908.2661)].
Centralized Coordination Mechanisms for regulating agent participation, turn-taking, and conflict avoidance in multi-agent multimodal dialogue (Hasan et al., 24 Mar 2026).

Table 1. Representative Control Modes and Architectures | Paradigm | Key Elements | References | |------------------------------------|-------------------------------------|------------------| | FSM Mode Switch | Intention tracker, thresholds | (Huang et al., 2022) | | Hierarchical C2C | VLM grounding, MARL, QP control | (Zhang et al., 4 Mar 2026) | | Decentralized Actor–Critic | LLM-actors, local/global critics | (Wang et al., 12 Mar 2025) | | Federated LfD | Edge SGD updates, FedAvg, profiles | (Papadopoulos et al., 2020) | | Mixed-Initiative Planning | Meta-planner, allocation Q, LLMs | (Yu et al., 7 Aug 2025) | | Centralized Turn-Taking/Conflict | LLM-scoring, schedule/prune actors | (Hasan et al., 24 Mar 2026) |

These architectures enable modularity, scalability, and adaptability, supporting robust operation across dynamic, human-inhabited environments.

3. Mode-Switching and Intention Tracking

High-reliability collaboration demands principled detection of when and how the system transitions between interaction modes:

Intention Tracking via Sensor Fusion: Integrates vision-based hand pose/detection (e.g., OpenPose + RealSense), force/torque sensing, and robot proprioception to estimate human intent in real time (Huang et al., 2022, Huang et al., 2023).
Scoring and Thresholds: Guidance intention is quantified (e.g., $S_{\rm guide}(t) = \alpha\,\|p_h(t)-p_r(t)\| + \beta\,\mathbf{1}\{F(t)\neq0\}$ ), and compared to tuned thresholds for FSM transitions (Huang et al., 2022).
Multi-Level Safety Modules: Vision-based workspace protection, contact-triggered halts, and hierarchical mode switching safeguard humans during coexistence, pause, or direct interaction (Huang et al., 2023).
Mode-switch Representation: Two-state FSMs or higher-level sequence models switch between coexistence ( $M_0$ ) and cooperation ( $M_1$ ), with explicit criteria for entry/exit based on proximity/contact force (Huang et al., 2022).
Human-Like Theory-of-Mind Models: Neural encoders predict teammates’ future actions, allowing human guidance of a single agent to propagate to coordinated team policies (Ji et al., 2024).

These mechanisms are realized in real systems operating at loop rates up to 100 Hz and detection latencies $<$ 50 ms, ensuring responsive, safe, and fluid role or control negotiation.

Advanced HRI leverages a wide spectrum of communication and embodiment channels:

Direct vs. Indirect Communication: Direct (speech, gesture) and indirect (interface-mediated, AR overlays) communication are combined to maximize situational awareness and trust in multi-human, multi-robot contexts. Mixed modes yield highest info quality and user preference (72% ranked mixed first) (Patel et al., 2021).
Multimodal Fusion: Speech, gesture, gaze, and locomotion are fused by LLM-driven planners, with constrained action libraries ensuring embodiable, socially grounded actions (Hasan et al., 24 Mar 2026).
AR/VR Shared Workspaces: Robots and humans share identical, synchronized AR overlays—supporting shared perception, proactive manipulation of virtual objects, and mutual world understanding. Mathematical models of human AR utility (cost) are computed from gaze, pose, and occlusion (Qiu et al., 2020).
Interaction Policies for Large Teams: AR-HMDs and centralized comms enable non-expert supervision, teleoperation, and dynamic command/control of up to 50+ autonomous agents, with spatial anchoring and multi-modal (air-tap, pinch, voice) input (Regal et al., 2024).
Transparency and Explanation: Multi-agent policy explanation uses sequence-of-landmarks approaches, combining strategy-conditioned state visuals and LLM-generated storyboards to train and support human exploration and collaboration (Pandya et al., 2023).

These modalities enable social context, group awareness, affective expression, and user trust—essential for both proximate and remote, small- and large-scale HRI.

5. Mixed-Initiative, Negotiation, and Role Dynamics

Mixed-initiative dialog and dynamic role assignment are central to adaptability in HRI:

Negotiated Task Allocation: Metaplanners and planners (typically LLM-based) parse human dialog and infer constraints on which partner (human or robot) performs task steps, solving constrained optimizations to minimize human effort given varying willingness signals (Yu et al., 7 Aug 2025).
Proposal, Acceptance, and Rejection Motifs: Agents can propose, counter-propose, accept, or reject assignments, with negotiation running until feasible, mutually-acceptable assignments are reached. Dynamic estimation of human willingness ( $p_{H,t}$ ) modulates allocation (Yu et al., 7 Aug 2025).
Automated Delegation Logic: For each subtask, robots estimate probability of success and human/robot cost; human delegation is triggered if confidence $\lt\theta_{\rm conf}$ or cost conditions are met (Sun et al., 30 Nov 2025).
Leader–Follower, Synchronous, and Peer Coordination: Systems implement leader–follower (supervisory) control, peer-to-peer parallel execution, and dynamic supervisory assignment via top-level planners in multi-agent hierarchies [(Xin et al., 2024); (0908.2661)].
Residual Adaptation without Explicit Role Assignment: Residual MARL architectures internalize partner dynamics, yielding emergent but unscripted leader–follower or synchronous organizational patterns in physically coupled tasks (Zhang et al., 4 Mar 2026).

User studies show that such interaction schemas measurably increase task success, reduce human workload, and are consistently rated as more communicative and satisfying relative to single-initiative or monolithic alternatives (Yu et al., 7 Aug 2025).

6. Evaluation Metrics, Empirical Studies, and System Limitations

Empirical validation covers both objective and subjective measures across diverse settings:

Task Success and Efficiency: Quantitative benchmarks include success rate, completion rate, redundancy rate, command latency, localization accuracy, and team regroup efficiency (Sun et al., 30 Nov 2025, Regal et al., 2024, Zhang et al., 4 Mar 2026).
User Experience: Likert scale ratings for trust, info quality, clarity, predictability, fluency, and satisfaction; cognitive load (NASA TLX), preference rankings, and qualitative feedback (Patel et al., 2021, Pandya et al., 2023, Yu et al., 7 Aug 2025).
Safety and Robustness: Latency and responsiveness under vision/contact-based monitoring; zero collision incidence in pilot studies; false alarm rates under sensor fusion (Huang et al., 2023).
Scalability and Performance: Demonstrations of 50+ agent systems in AR-HMD frameworks, minimal comms and localization degradation, and stress tests in urban or microgravity simulated settings (Regal et al., 2024, Xin et al., 2024).
Model Limitations: No statistically significant differences in raw task completion rate in some settings; scalability of centralized arbitration is an ongoing concern; vision-based intention estimation may not generalize to multi-human, high-occlusion arenas (Huang et al., 2022, Hasan et al., 24 Mar 2026).

Researchers note that full human-subject ablations, multi-agent extension beyond dyads, and formal scaling/robustness studies remain priority avenues for future work (Huang et al., 2022, Dahiya et al., 2022).

In summary, multi-agent and human–robot interaction modes are defined by taxonomic axes spanning team structure, interaction model, and modality; supported by formal FSMs, hierarchical planners, and decentralized MARL; structured by principled mode switching and intention tracking; embedded in multimodal, adaptive interfaces; shaped by mixed-initiative negotiation and emergent roles; and rooted in rigorous empirical evaluation. Their coordinated integration is essential for robust, safe, and socially appropriate robotics in dynamic, human-populated environments (Huang et al., 2022, Hasan et al., 24 Mar 2026, Yu et al., 7 Aug 2025, Xin et al., 2024, Sun et al., 30 Nov 2025, Pandya et al., 2023).