Conjecture on data quality and distribution explaining LLaDA’s differential performance
Ascertain whether differences in training data quality and distribution, largely due to the closed-source nature of large language model datasets, explain LLaDA 8B’s stronger results in mathematics and Chinese tasks alongside relatively weaker performance on some other benchmarks.
Sponsor
References
We conjecture that the strengths stem from the same factors as its relatively weaker performance in some tasks—differences in data quality and distribution, largely due to the closed-source situation of LLM datasets.
— Large Language Diffusion Models
(2502.09992 - Nie et al., 14 Feb 2025) in Benchmark Results, Section 4.2