Conjecture on data quality and distribution explaining LLaDA’s differential performance

Ascertain whether differences in training data quality and distribution, largely due to the closed-source nature of large language model datasets, explain LLaDA 8B’s stronger results in mathematics and Chinese tasks alongside relatively weaker performance on some other benchmarks.

Background

In comprehensive comparisons, LLaDA 8B shows strong performance on certain tasks, particularly in math and Chinese, while being comparatively weaker on others.

Because training datasets of comparable LLMs are often closed-source, the authors hypothesize that differences in data quality and distribution could be driving both the strengths and weaknesses, but this has not been verified.

References

We conjecture that the strengths stem from the same factors as its relatively weaker performance in some tasks—differences in data quality and distribution, largely due to the closed-source situation of LLM datasets.

Large Language Diffusion Models (2502.09992 - Nie et al., 14 Feb 2025) in Benchmark Results, Section 4.2