Model Cards for AI Teammates: Comparing Human-AI Team Familiarization Methods for High-Stakes Environments (2505.13773v1)

Published 19 May 2025 in cs.AI, cs.HC, and cs.MA

Abstract: We compare three methods of familiarizing a human with an AI teammate ("agent") prior to operation in a collaborative, fast-paced intelligence, surveillance, and reconnaissance (ISR) environment. In a between-subjects user study (n=60), participants either read documentation about the agent, trained alongside the agent prior to the mission, or were given no familiarization. Results showed that the most valuable information about the agent included details of its decision-making algorithms and its relative strengths and weaknesses compared to the human. This information allowed the familiarization groups to form sophisticated team strategies more quickly than the control group. Documentation-based familiarization led to the fastest adoption of these strategies, but also biased participants towards risk-averse behavior that prevented high scores. Participants familiarized through direct interaction were able to infer much of the same information through observation, and were more willing to take risks and experiment with different control modes, but reported weaker understanding of the agent's internal processes. Significant differences were seen between individual participants' risk tolerance and methods of AI interaction, which should be considered when designing human-AI control interfaces. Based on our findings, we recommend a human-AI team familiarization method that combines AI documentation, structured in-situ training, and exploratory interaction.

PDF Abstract

This paper (Bowers et al., 19 May 2025 ) investigates how different methods of familiarizing human operators with an AI teammate affect team performance, strategy, and understanding in a simulated high-stakes intelligence, surveillance, and reconnaissance (ISR) environment. The core problem addressed is the challenge of building effective human-AI teams, particularly in critical domains where misunderstanding the AI's capabilities or behavior can have severe consequences. Unlike human-human teams that share inherent cognitive frameworks, human-AI teams require explicit mechanisms for humans to build a reliable mental model of their AI partners.

The research explores the following questions:

What information about an AI teammate is most effective for human-AI team coordination?
Which familiarization methods are most effective?
How does familiarization affect mission strategies?
How does familiarization affect team performance?

The paper utilized a custom 2D Pygame ISR simulation environment. In this environment, a human-controlled aircraft and an AI-controlled aircraft collaborate to search for targets and identify weapons within weapon ranges, which incurs damage. The goal is to identify as many targets and weapons as possible within a time limit while preserving aircraft health. The AI teammate uses a heuristic path-planning policy based on search priorities (Target/Weapon, Area) and has three operating modes: Auto (AI chooses priorities based on heuristics), Priorities (human sets priorities, AI plans), and Override (human sets waypoints or commands hold). The AI aircraft was designed to be more resilient to damage, introducing an opportunity for strategic risk allocation. The human can switch the AI's mode using commands like Search Priorities, Hold, Waypoint, and Auto. Team strategies could vary along axes of Control Mode (Auto, Priorities, Override), Risk Distribution (symmetric vs. asymmetric task allocation based on vulnerability), and Spatial Coordination (divide-and-conquer vs. tag team).

A between-subjects user paper with 60 participants was conducted. Participants were divided into three groups, each receiving a different type of familiarization before completing four ISR rounds with the AI teammate:

Documentation: Participants read a 6-page document describing the AI's policy, strengths/weaknesses, algorithms, and technical details (inspired by model cards).
In-situ: Participants completed a training round with the AI teammate, allowing for observation and experimentation.
Control: Participants received minimal information, only being told the AI would help and shown how to command it.

Various metrics were collected: Mission Performance (points for IDs, remaining health/time, penalties for destruction), Strategy Metrics (time in each AI mode, target/weapon ID percentages by human/AI, spatial proximity), Situation Awareness (SAGAT scores), Workload (NASA TLX), Cognitive Trust, and AI Understanding (a 3-question quiz on AI behavior vignettes).

Key Findings and Practical Implications:

Value of AI Information:

Participants found understanding the AI's decision-making processes (path planning, target prioritization) and its relative strengths/weaknesses (more resilient to damage) most valuable. This allowed them to anticipate the AI's actions and delegate tasks effectively, particularly assigning risky weapon identification to the AI.
Highly technical details like the specific heuristic formulas were generally not useful during gameplay.
Documentation provided a strong theoretical foundation but required time (1-2 rounds) for participants to operationalize.
In-situ familiarization led to a more behaviorist understanding and encouraged experimentation, particularly with the Auto mode which Control and Documentation groups were hesitant to use due to lack of understanding or low-risk bias.

Team Strategy Differences:

All groups preferred the medium-level Priorities mode, allowing the human to set goals while the AI handled path planning.
Familiarization groups, especially Documentation early on, were more likely to use Override mode for direct control and less likely to use Auto mode compared to In-situ.
Familiarization groups (Documentation and In-situ) were significantly more likely to command the AI to identify weapons (affirming H2) and less likely to command target identification compared to the Control group, especially in early rounds. This demonstrates a quicker adoption of the asymmetric risk distribution strategy enabled by understanding the AI's relative resilience.
The "divide-and-conquer" spatial strategy was preferred across groups.

Performance and Risk Delegation:

There was no statistically significant difference in average scores across groups (refuting H1).
However, the Documentation group showed less score variance and more consistent score growth across rounds.
The analysis revealed that Documentation-based familiarization led to faster adoption of delegating weapon ID to the AI. While this low-risk strategy ensured a consistent score (around 1300 points), it often prevented teams from achieving higher scores possible by also identifying weapons themselves (earning significant bonuses for finishing early). The In-situ group, more willing to experiment, had higher variance and included some top-scoring teams. This suggests documentation can bias users towards specific strategies, potentially limiting exploration of higher-reward, higher-risk approaches.

AI Understanding and Workload:

While familiarization groups scored higher on the AI Understanding quiz, the difference was not statistically significant (refuting H3). In-situ participants seemed to infer path planning from observation better than Documentation participants.
Situation awareness scores were similar.
The In-situ group reported higher effort and frustration (refuting H4), likely due to the active experimentation required during training and gameplay.

AI/ML Experience:

Counter-intuitively, participants with prior AI/ML experience performed worse on the AI Understanding quiz within the familiarization groups, potentially due to applying incorrect preconceptions about heuristic policies.
Participants with AI/ML experience were also less likely to delegate weapon identification to the agent.

Implementation Recommendations:

Based on these findings, the authors recommend a hybrid human-AI team familiarization process combining:

Documentation: Providing explicit information about the AI's decision-making processes and capabilities/limitations, akin to a 'model card'. The content should be carefully curated, focusing on operationally relevant details rather than complex formulas, and potentially tailored to the user's technical expertise.
Structured In-situ Training: Including training sessions where the human works directly with the AI on representative tasks. This allows observational learning and building a behaviorist understanding.
Exploratory Interaction: Providing a low-risk environment (like a dedicated training mode or sandbox) where humans can freely experiment with AI commands and modes to understand its behavior and develop their preferred interaction style without mission consequences.

Practical implementation of this hybrid approach would involve developing training modules that integrate these elements. For high-stakes environments, simulators like the ISR game used in the paper would be crucial for in-situ and exploratory training. Model cards for AI systems should prioritize explaining how the AI makes decisions and what its operational strengths and weaknesses are, rather than exposing raw algorithms. Interface design should account for individual variability in desired control levels and risk tolerance, potentially offering different levels of AI autonomy or guidance based on user preference or experience. Addressing common misconceptions about AI behavior (both for novices and potentially experts) during training is also vital. The finding that documentation can sometimes lead to overly conservative strategies suggests that training should also encourage appropriate risk-taking where beneficial to mission objectives.

Model Cards for AI Teammates: Comparing Human-AI Team Familiarization Methods for High-Stakes Environments (2505.13773v1)

Related Papers

YouTube