Scaling Behavior with More Diverse Model Mixtures

Determine how the accuracy and convergence behavior of the Fortytwo Protocol’s pairwise ranking consensus scales with the number of participating nodes when the swarm comprises a more diverse mixture of models (heterogeneous architectures, sizes, and specializations), specifically characterizing whether the observed performance plateau around approximately 30 nodes shifts or changes under increased model diversity.

Background

The paper evaluates Fortytwo’s performance as the number of nodes increases from 3 to 35 on GPQA Diamond, observing rapid gains up to 7 nodes and a plateau around 30 nodes. This analysis is conducted with a particular mix of models and temperature settings, showing consistent superiority over majority voting across sizes.

In the key observations following the scaling experiment, the authors explicitly note an open question regarding how scaling might behave with a more diverse mixture of models. This highlights uncertainty about whether greater architectural and capability heterogeneity would alter the plateau point, improve asymptotic performance, or change convergence dynamics.

References

Efficient convergence: Performance plateaus around 30 nodes, suggesting that moderate-sized swarms capture most benefits, although open question remains how this scaling would behave with more diverse mixture of models

Fortytwo: Swarm Inference with Peer-Ranked Consensus (2510.24801 - Larin et al., 27 Oct 2025) in Section 6.2.2 (Ablation Studies: Swarm Size Scaling)