Interactive Evaluation in AI Systems
- Interactive evaluation is a dynamic assessment paradigm that evaluates systems through multi-turn trajectories, capturing actions, feedback, and state transitions.
- It utilizes diverse inputs—from human interactions and tool use to multi-agent scenarios—coupled with multidimensional metrics like task success and process efficiency.
- This approach offers scalable, robust insights into system adaptability, enabling richer and more actionable performance evaluations than static benchmarks.
Interactive evaluation is a model assessment paradigm in which system behavior is measured not via fixed-response outputs but through dynamic, multi-turn trajectories resulting from consequential interaction with environments, users, tools, or other agents. Unlike traditional response-centered benchmarks that judge isolated predictions on static inputs, interactive evaluation encompasses evidence generated from live processes—whether human-bot dialog, collaborative code repair, tool use sessions, or co-adaptive learning—and uses a trajectory-to-judgment protocol to support richer claims about process, recoverability, robustness, and system adaptability (Xuan et al., 18 May 2026).
1. Conceptual Foundations and Formal Definition
Interactive evaluation is formally defined as a mapping , where comprises trajectory artifacts (sequences of actions, observations, feedback, state transitions) generated by a model’s interaction in a specified environment, and contains the judgments (scores, pass/fail, rankings, or diagnostics) produced by an autonomous evaluation procedure (Xuan et al., 18 May 2026). This formalism generalizes classical evaluation, where consists of static input–output pairs, to settings with temporally extended sequences and contextual dependencies.
This shift in evidence—away from isolated responses toward process-generated traces—is required as AI systems increasingly act through tools, respond over several iterations, coordinate with users, or play with and against other agents, making the quality of interaction dynamics as central as final output. Interactive evaluation thus subsumes and extends notions from usability studies, multi-agent testing, human-in-the-loop ML, and agentic system assessment.
2. Methodologies and Taxonomies
A principled approach to interactive evaluation requires careful design of both the admissible trajectory artifacts and the judgment logic. Xuan et al. (Xuan et al., 18 May 2026) propose a two-axis taxonomy:
- Evaluation Inputs ():
- Tools/Environments: Agents operate on evolving state via tool use, web navigation, program execution, or external APIs (e.g., WebArena, OSWorld, AppWorld).
- Users: Human feedback, clarification, and error-correction drive the interaction (τ-bench, ToolSandbox, IQA-EVAL).
- Other Agents: Multi-agent settings with negotiation, planning, or competition (e.g., SOTOPIA, BattleAgentBench).
- Hybrid/Dynamic: Persistent state or cross-session contexts (MemoryArena, ARC-AGI-3).
- Evaluation Programs ():
- Task Success: Whether the system reaches a prescribed goal or completes a benchmark task.
- Process Quality & Efficiency: Intermediate measures—action localization, number of calls, tool selection quality, economic use of resources.
- Recoverability/Robustness: Ability to repair from error, adapt under distributional shift, or resist perturbations.
- Safety, Alignment, Social Competence: Honesty, cooperation, constraint adherence in user/agent interaction.
This framework elucidates where extant interactive benchmarks are concentrated (principally tools/environments × task success) and highlights under-explored combinations, such as user-driven process quality or agent robustness to adversarial perturbation.
3. Exemplary Interactive Evaluation Systems Across AI Modalities
Interactive evaluation is instantiated in diverse research domains, each adapting the key principles to modality-specific artifacts and judgments:
- Dialog Systems and Human–Bot Interaction:
DSTC9’s Interactive Evaluation Track (Mehri et al., 2022) deployed competing dialog models in live user chat, logging multi-turn transcripts and engagement metrics (e.g., average dialog length, per-turn feedback, qualitative ratings), showing strong correlation () between dialog length and human-perceived quality.
- Task-Oriented Dialogue With Simulated Users:
Meng et al. (Cheng et al., 2022) introduce a goal-oriented user simulator that generates dynamic utterances in response to system outputs, measuring not just inform and success rates but also fluency (average negative log-probability under a fine-tuned LM) and session coherence (BERT-based classifier scores), essential for detecting degradation masked by task completion rates.
- LLM Generation and Prompt Engineering:
EvalLM (Kim et al., 2023) and EvalAssist (Ashktorab et al., 2 Jul 2025) support criteria-driven, LLM-mediated, interactive prompt evaluation. Outputs are compared under user-defined rubrics in pairwise or scalar fashion, with iterative revision, LLM-generated explanations, and reliability analysis (e.g., Fleiss’ ).
- Software Engineering and Code Feedback:
Interactive protocols for coding assistants (Pan et al., 25 Feb 2025, Rontogiannis et al., 26 Aug 2025) model error-correction as a multi-turn loop, where a simulated user or “interviewer” issues targeted hints or corrections, and success is measured by requirement-graph satisfaction, code test-case accuracy, and efficiency of repair.
- Data-Centric and Visual Generation:
DyEval (Mi et al., 2024) for text-to-image systems uses a dynamic tree of test topics and inputs, growing coverage adaptively based on failure patterns and leveraging LLMs for failure-trigger analysis; InteractScience (Chen et al., 10 Oct 2025) measures not only code correctness via programmatic testing but also visual output alignment using CLIP and VLM-based judges.
- Medical Imaging Segmentation:
A clinically grounded evaluation pipeline (Esmaeili et al., 10 Oct 2025) models user interactions (clicks, scribbles) as part of the test scenario, quantifies information retention, convergence, and robustness, and uses process metrics (normalized AUC, number of interactions to threshold Dice) tied to the annotation budget.
- Dialog Self-Play and Unsupervised Metrics:
Self-play evaluation (Ghandeharioun et al., 2019) for conversational agents leverages trajectories of model–model interaction, computing sentiment and semantic coherence as linear-proxy metrics, achieving Pearson correlation with human interactive quality.
A summary of core systems and their evaluation axes:
| System/Paper | Evidence () | Judgments (0) |
|---|---|---|
| DSTC9 (Mehri et al., 2022) | Real user–bot dialog trajectories | Human/auto dialog quality, engagement |
| IQA-EVAL (Li et al., 2024) | Simulated dialog with persona variants | Fluency, helpfulness, efficiency, accuracy |
| EvalLM (Kim et al., 2023) | Prompt–output pairs, user criteria logs | Automated/rubric-based LLM judgments |
| When Benchmarks Talk (Pan et al., 25 Feb 2025) | Multi-turn code–feedback chains | Test-case accuracy, steerability, ranking shifts |
| DFEE (He et al., 2022) | DataFlow graphs, execution traces | Execution accuracy, graph diagnostics |
| DyEval (Mi et al., 2024) | Tree of test–prompt–image triplets | Bug count, pass-rate, failure triggers |
| InteractScience (Chen et al., 10 Oct 2025) | Unit-test logs, visual renderings | PFT pass rate, CLIP/VLM score |
4. Interactive Evaluation Workflows and Metrics
Designing interactive evaluation protocols requires explicit specification of:
- System and Evidence: Model identity, tool access, wrappers, and which trajectory artifacts (action/state logs, dialog turns, code edits) are captured (Xuan et al., 18 May 2026).
- Interaction Protocol: Initial conditions, allowed actions, user or agent (counterpart) policies, random seeding, and budget/horizon constraints (Chen et al., 10 Oct 2025).
- Multi-Dimensional Metrics:
- Task success, e.g., whether all dialog slots are filled (Cheng et al., 2022), code passes all tests (Rontogiannis et al., 26 Aug 2025), or visual output meets checkpoints (Chen et al., 10 Oct 2025).
- Process measures, such as action-count, dialog length, number of interactions to achieve threshold Dice or success (Esmaeili et al., 10 Oct 2025, Mehri et al., 2022).
- Recoverability and robustness, such as post-error recovery rates (Pan et al., 25 Feb 2025), process repairs (Chen et al., 10 Oct 2025), and failure handling (Xuan et al., 18 May 2026).
- Efficiency, such as number of hints to solution (Rontogiannis et al., 26 Aug 2025), minimal edit distance per behavior change (Pan et al., 25 Feb 2025), or bug discovery rate (Mi et al., 2024).
- Aggregation and Variance: Statistical measures across seeds, environments, persona variants, or evaluator models to capture brittleness and stability (Li et al., 2024, Xuan et al., 18 May 2026).
- Reporting Standards: Comprehensive documentation of system, protocol, evidence, metric definitions, and aggregation procedures (Xuan et al., 18 May 2026).
Interaction-driven metrics often require normalization for action budgets, adaptive trajectories, and mixed-initiative feedback loops, reflecting the move from pass/fail or static BLEU/accuracy scores to continuous, multidimensional surfaces.
5. Design Principles, Challenges, and Best Practices
Interactive evaluation imposes unique demands and risks (Xuan et al., 18 May 2026):
- Design for visibility and interpretability: Documenting all protocol elements and trajectory artifacts is needed for reproducible and interpretable results.
- Perturbation and repair: Evaluation protocols should probe not only ideal operation but also model behavior under misleading feedback, distribution shift, or adversarial conditions (Xuan et al., 18 May 2026, Pan et al., 25 Feb 2025).
- Multi-dimensional judgment: Report outcome, process, and risk separately; do not collapse all axes into a single scalar.
- Shared infrastructure: Use toolkits enabling log replay, querying, and trajectory exploration (e.g., InFerActive’s tree visualization (Hwangbo et al., 11 Dec 2025), BotEval’s interaction dashboard (Cho et al., 2024)).
- Evaluator/model dependence: Results may differ under different user simulators, persona settings, or LLM-as-judge variants; robustness to such heterogeneity should be measured (Li et al., 2024, Ashktorab et al., 2 Jul 2025).
- Standardization vs. diversity: Format standardization should not preclude protocol innovation; benchmarking infrastructure must allow adding new axes of interaction and judgment.
A key theme is that process-level artifacts—how and why a system acts, adapts, or fails—are first-class evidence; only through capturing and evaluating these can generalizable system-level claims be made.
6. Domain-Specific Advances and Ongoing Research
Recent research illustrates the breadth of interactive evaluation:
- Knowledge Graph Completion: PROBE-Web defines a continuous landscape of sharpness and popularity-bias for rank-based evaluation, allowing in situ exploration of trade-offs relevant to KGC practitioners (Moon et al., 8 Jun 2026).
- Explorable Explanations: By extracting finite state machine models from AI-generated HTML/JS demos, EE-Eval measures structural, semantic, and behavioral interactivity alignment with pedagogical intent, outperforming pure code-execution baselines (Wang et al., 30 Jun 2026).
- Human–Model QA: IQA-EVAL shows that persona-conditioned, LLM-based evaluation agents attain close alignment (1–2) with human-interactive QA ratings, allowing principled, scalable assessment of dialogue-like QA dynamics (Li et al., 2024).
- Medical Segmentation: Evaluation pipelines are proposed that enforce native-space, information-preserving prompts and rigorous reporting of convergence/failure, advancing from synthetic “toy” evaluations to clinically actionable assessment (Esmaeili et al., 10 Oct 2025).
- Adaptive Visual Assessment: DyEval demonstrates that LLM-driven, user-steered prompt expansion uncovers up to 2.56× more model failures than static test sets in text-to-image evaluation, exposing rare, compositional, and culturally sensitive failure modes (Mi et al., 2024).
- Model Debugging and Alignment: Interactive systems such as EvalLM and EvalAssist support not only prompt tuning and model selection but also iterative refinement of rubrics and reliability-calibrated LLM-based judgments, facilitating more aligned deployment (Kim et al., 2023, Ashktorab et al., 2 Jul 2025).
7. Implications and Future Directions
The emergence of interactive evaluation as a rigorous evaluation paradigm has several implications:
- Scientific Rigor: Merely adopting fixed-dataset paradigms in agentic settings leads to brittle, incomplete, and non-comparable results. Explicitly treating evaluation itself as a subject of design science clarifies what claims are supported by a given protocol, and what cannot be inferred (Xuan et al., 18 May 2026).
- Customization and Personalization: Interactive evaluation enables direct alignment of system performance to heterogeneous user goals, demands for novelty, robustness, or risk tolerance, as realized in PROBE-Web’s perspective-aware landscape and IQA-EVAL’s persona conditioning (Moon et al., 8 Jun 2026, Li et al., 2024).
- Scalability and Diagnosis: LLM-powered or programmatic interactive assessment (e.g., self-play, agent-simulated feedback, LLM-as-judge) offers scalable proxies that approach human-level diagnostic power while enabling precise error localization and process analysis in otherwise intractable evaluation domains (Ghandeharioun et al., 2019, Kim et al., 2023, Li et al., 2024).
- Robustness to Distribution Shift: Interactive protocols directly assess recoverability, resistance to perturbation, and adaptation to feedback, exposing brittleness that is invisible in snapshot static testing (Pan et al., 25 Feb 2025, Esmaeili et al., 10 Oct 2025).
A plausible implication is that interactive evaluation will increasingly underpin both model development and deployment monitoring, not only as a means of research benchmarking but as a necessary substrate for robust, adaptive, and responsible real-world AI systems. The field continues to advance methodologies for standardization without rigidity, for scalable, personalized, and context-driven judgment, and for closing the gap between abstract capability claims and evidence-based system-level performance.