Dice Question Streamline Icon: https://streamlinehq.com

Evaluation of Agentic AI Systems Under Realistic Constraints

Establish reproducible and realistic evaluation frameworks and benchmarks for agentic artificial intelligence systems used across scientific workflows that accurately measure performance under real-world constraints, including adherence to provenance, claim-level citation correctness, tool-call success rates, inter-agent handoff quality, confidence calibration, and time or cost to solution when facing API failures, shifting corpora, and strict latency or compute budgets.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper argues that agentic AI systems—those that plan, act, and reflect across stages of scientific work—are increasingly used from literature retrieval to experiment design and execution. Despite this growing role, the authors note that assessing such systems in conditions that mirror real scientific practice is unresolved.

They highlight that current testbeds reveal issues such as data contamination, brittle scaling with compute, and poor generalization beyond memorization. The authors suggest that evaluation must go beyond task accuracy to include process-oriented metrics and robustness to operational constraints commonly encountered in research environments.

References

Evaluating agentic systems under the constraints they face in practice remains an open problem.

Rethinking Science in the Age of Artificial Intelligence (2511.10524 - Eren et al., 13 Nov 2025) in Subsection "Realistic Benchmarks and Evaluation" under Section "AI and Science"