Ascertain tool usage in closed-weight API models

Ascertain whether closed-weight large language models accessed via APIs employ external tools—such as calculators or theorem provers—within their inference pipelines, which could affect measured reasoning gaps.

Background

Reasoning gap measurements could be confounded if models rely on external tools during inference, especially for arithmetic or symbolic tasks. The authors explicitly note uncertainty regarding tool usage in closed-weight API-served models and suggest future evaluations controlling for tool usage with open-weight models.

References

For closed weights models accessed through APIs, it is unclear whether tool usage is part of the inference pipeline.

— Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap (2402.19450 - Srivastava et al., 2024) in Threats to the validity of results (Tool usage)

Ascertain tool usage in closed-weight API models

Background

References

Related Problems