Effects of data proportion at scales near model capability limits

Investigate how the proportion of mathematics versus computer‑use data affects performance when training multimodal reasoning models at scales near their capability limits, and ascertain whether uniform performance persists or trade‑offs between these reasoning tasks emerge.

Background

Experiments conducted at a relatively small data scale show that increasing total data improves overall performance and that adding more data from one domain can also benefit another (e.g., math data improving some computer‑use benchmarks).

The authors emphasize that their present scale may not challenge model capacity; thus, it remains uncertain whether observed uniform gains will hold at larger scales or whether inter‑task trade‑offs will appear as models approach saturation.

References

A clear open question is to study the effects of data proportion at a scale which challenges the edge of current models' capabilities: do our insights about strong uniform performance hold, or do trade-offs between different reasoning tasks become more obvious at larger scales?

Phi-4-reasoning-vision-15B Technical Report  (2603.03975 - Aneja et al., 4 Mar 2026) in Open research questions, Section 3.2 (Mathematics and Science vs. Computer-Use Data Proportion)