Ascertain tool usage in closed-weight API models
Ascertain whether closed-weight large language models accessed via APIs employ external tools—such as calculators or theorem provers—within their inference pipelines, which could affect measured reasoning gaps.
References
For closed weights models accessed through APIs, it is unclear whether tool usage is part of the inference pipeline.
— Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap
(2402.19450 - Srivastava et al., 29 Feb 2024) in Threats to the validity of results (Tool usage)