Dice Question Streamline Icon: https://streamlinehq.com

Ascertain tool usage in closed-weight API models

Ascertain whether closed-weight large language models accessed via APIs employ external tools—such as calculators or theorem provers—within their inference pipelines, which could affect measured reasoning gaps.

Information Square Streamline Icon: https://streamlinehq.com

Background

Reasoning gap measurements could be confounded if models rely on external tools during inference, especially for arithmetic or symbolic tasks. The authors explicitly note uncertainty regarding tool usage in closed-weight API-served models and suggest future evaluations controlling for tool usage with open-weight models.

References

For closed weights models accessed through APIs, it is unclear whether tool usage is part of the inference pipeline.

Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap (2402.19450 - Srivastava et al., 29 Feb 2024) in Threats to the validity of results (Tool usage)