Cause of JA–EN performance discrepancy under mixed Jagle+FineVision training

Ascertain the cause of the observed discrepancy that, when training the 2.2B Qwen3-1.7B-Instruct plus SigLIP2-so400m-patch16-512 vision-language model, the macro-averaged Japanese score is higher for training on Jagle alone than for training on the mixture of Jagle and FineVision, whereas the macro-averaged English score improves when training on the mixture relative to FineVision alone; specifically, evaluate whether the data size imbalance between Jagle and FineVision contributes to this effect.

Background

In the Results section, the authors report that mixing Jagle with FineVision improves English performance over FineVision-only but yields lower Japanese performance than training on Jagle alone. This asymmetry contrasts with their expectation that increased data diversity benefits both languages.

They hypothesize that the smaller size of Jagle relative to FineVision might partly explain the drop in Japanese performance in the mixed setting but explicitly state that the reason is not entirely clear and leave a deeper investigation to future work.

References

On the other hand, the Japanese task average is higher for Jagle alone than for Jagle combined with FineVision. The reason for this discrepancy between JA Avg and EN Avg is not entirely clear, though it may partly stem from the smaller data size of Jagle relative to FineVision; we leave a deeper investigation to future work.

Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models  (2604.02048 - Sugiura et al., 2 Apr 2026) in Section 5 (Experiments), Subsection Results, paragraph “Impact on English tasks”