Non-monotonic scaling on Tier D (multi-source synthesis with recovery)

Determine why, on the AgentFloor benchmark’s Tier D tasks that require multi-source synthesis with conflict recovery, the 4B model nemotron-3-nano:4b (36% task completion rate) outperforms the 26B model gemma4:26b (32% task completion rate), indicating non-monotonic scaling with parameter count for this capability.

Background

Tier D in AgentFloor tests multi-source synthesis with conflict recovery under an abstract, deterministic tool environment. In the reported results, the best Tier D performance among open-weight models is achieved by a 4B model (nemotron-3-nano:4b) rather than the larger 26B gemma4:26b.

The authors explicitly state that they do not fully understand this observation and that it resists a clean explanation, suggesting that capability on Tier D does not increase monotonically with model size in their corpus.

References

Three observations we do not fully understand. Three cells in the corpus are descriptively striking and resist clean explanation. The highest tier-D cell is a 4\,B model (nemotron-3-nano:4b, 36\% [28, 44]), not the 26\,B mid-scale (gemma4:26b, 32\%): multi-source synthesis with recovery does not scale monotonically with parameter count in this corpus.

AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?  (2605.00334 - Karmakar et al., 1 May 2026) in Discussion, Section 7 ("Three observations we do not fully understand.")