Evaluation of Agentic AI Systems Under Realistic Constraints
Establish reproducible and realistic evaluation frameworks and benchmarks for agentic artificial intelligence systems used across scientific workflows that accurately measure performance under real-world constraints, including adherence to provenance, claim-level citation correctness, tool-call success rates, inter-agent handoff quality, confidence calibration, and time or cost to solution when facing API failures, shifting corpora, and strict latency or compute budgets.
References
Evaluating agentic systems under the constraints they face in practice remains an open problem.
— Rethinking Science in the Age of Artificial Intelligence
(2511.10524 - Eren et al., 13 Nov 2025) in Subsection "Realistic Benchmarks and Evaluation" under Section "AI and Science"