Cause of mistral-small3.2:24b collapse on Tier B despite strong Tier A0/A performance

Investigate why the model mistral-small3.2:24b achieves high task completion rates on AgentFloor Tier A0 (96%) and Tier A (93%) but collapses to 16% on Tier B (sequential two-tool chaining), and determine whether a tool-template incompatibility in the native tool-calling path, rather than a capability ceiling, explains this drop.

Background

Tier B evaluates sequential two-tool chains where the output of one tool feeds another. Despite performing very well on Tiers A0 and A, mistral-small3.2:24b performs poorly on Tier B.

The authors hypothesize that a tool-template incompatibility in the native tool-calling path may be responsible, but they present this as an unresolved explanation and explicitly state they do not fully understand these observations.

References

Three observations we do not fully understand. Three cells in the corpus are descriptively striking and resist clean explanation. Mistral-small3.2:24b reaches 96\% on A0 and 93\% on A but collapses to 16\% on B; an SDR of 28.6\% on B is consistent with a tool-template incompatibility on the native tool-calling path rather than a capability ceiling.

AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?  (2605.00334 - Karmakar et al., 1 May 2026) in Discussion, Section 7 ("Three observations we do not fully understand.")