Definition of Hilbert’s avg. pass@ metric on PutnamBench
Determine the precise definition and computation procedure of the "avg. pass@" metric used to report Hilbert’s performance on the PutnamBench benchmark, including how parallelized sub-agent calls are aggregated and how the metric relates to standard pass@n evaluation.
References
avg.~pass@ is used for Hilbert, an agentic framework that parallelizes reasoning and verification at different levels. The exact definition of this metric is unclear; our best assumption is that it reflects the average number of calls to Hilbert's sub-agents.
— Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics
(2510.12787 - Tredici et al., 14 Oct 2025) in Section 5.2 (Results), PutnamBench paragraph; Table \ref{tab:putnam_results}