Verify capacity-driven gains from multilingual SFT

Determine whether the 27B TranslateGemma model, due to its higher capacity, benefits more from exposure to the large number of languages used during supervised fine-tuning than smaller TranslateGemma models, by providing direct experimental confirmation or refutation of this hypothesis.

Background

In Automatic Evaluation (Text translation), the report notes consistent improvements across 55 language pairs and discusses scale effects across model sizes (4B, 12B, 27B).

The authors explicitly hypothesize that the 27B model may have benefited more from the breadth of languages seen during supervised fine-tuning (SFT), but they state that they lack direct experimental confirmation of this effect.

References

We also hypothesize that the 27B model, with its higher capacity, will have benefited more from the vast amount of languages seen during the SFT phase (detailed in Appendix~\ref{sec:list-of-languages}), although we do not have direct experimental confirmation of this.

TranslateGemma Technical Report  (2601.09012 - Finkelstein et al., 13 Jan 2026) in Section 6.1 (Text translation)