Non-monotonic scaling on Tier D (multi-source synthesis with recovery)
Determine why, on the AgentFloor benchmark’s Tier D tasks that require multi-source synthesis with conflict recovery, the 4B model nemotron-3-nano:4b (36% task completion rate) outperforms the 26B model gemma4:26b (32% task completion rate), indicating non-monotonic scaling with parameter count for this capability.
References
Three observations we do not fully understand. Three cells in the corpus are descriptively striking and resist clean explanation. The highest tier-D cell is a 4\,B model (nemotron-3-nano:4b, 36\% [28, 44]), not the 26\,B mid-scale (gemma4:26b, 32\%): multi-source synthesis with recovery does not scale monotonically with parameter count in this corpus.
— AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
(2605.00334 - Karmakar et al., 1 May 2026) in Discussion, Section 7 ("Three observations we do not fully understand.")