Cause of mistral-small3.2:24b collapse on Tier B despite strong Tier A0/A performance
Investigate why the model mistral-small3.2:24b achieves high task completion rates on AgentFloor Tier A0 (96%) and Tier A (93%) but collapses to 16% on Tier B (sequential two-tool chaining), and determine whether a tool-template incompatibility in the native tool-calling path, rather than a capability ceiling, explains this drop.
References
Three observations we do not fully understand. Three cells in the corpus are descriptively striking and resist clean explanation. Mistral-small3.2:24b reaches 96\% on A0 and 93\% on A but collapses to 16\% on B; an SDR of 28.6\% on B is consistent with a tool-template incompatibility on the native tool-calling path rather than a capability ceiling.
— AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
(2605.00334 - Karmakar et al., 1 May 2026) in Discussion, Section 7 ("Three observations we do not fully understand.")