Cause of JA–EN performance discrepancy under mixed Jagle+FineVision training
Ascertain the cause of the observed discrepancy that, when training the 2.2B Qwen3-1.7B-Instruct plus SigLIP2-so400m-patch16-512 vision-language model, the macro-averaged Japanese score is higher for training on Jagle alone than for training on the mixture of Jagle and FineVision, whereas the macro-averaged English score improves when training on the mixture relative to FineVision alone; specifically, evaluate whether the data size imbalance between Jagle and FineVision contributes to this effect.
References
On the other hand, the Japanese task average is higher for Jagle alone than for Jagle combined with FineVision. The reason for this discrepancy between JA Avg and EN Avg is not entirely clear, though it may partly stem from the smaller data size of Jagle relative to FineVision; we leave a deeper investigation to future work.