Definition of Hilbert’s avg. pass@ metric on PutnamBench

Determine the precise definition and computation procedure of the "avg. pass@" metric used to report Hilbert’s performance on the PutnamBench benchmark, including how parallelized sub-agent calls are aggregated and how the metric relates to standard pass@n evaluation.

Background

In reporting leaderboard comparisons on PutnamBench, the authors explain that the Compute column lists pass@n for most systems, whereas Hilbert reports an "avg. pass@" reflecting its parallelized agentic setup.

The authors explicitly note that the meaning of this averaged metric is not clearly specified, complicating fair comparison across systems and raising the need to precisely define the metric and its aggregation method.

References

avg.~pass@ is used for Hilbert, an agentic framework that parallelizes reasoning and verification at different levels. The exact definition of this metric is unclear; our best assumption is that it reflects the average number of calls to Hilbert's sub-agents.

— Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics (2510.12787 - Tredici et al., 14 Oct 2025) in Section 5.2 (Results), PutnamBench paragraph; Table \ref{tab:putnam_results}

Definition of Hilbert’s avg. pass@ metric on PutnamBench

Background

References

Related Problems