Conjecture on SFT data quality causing declines in some benchmarks

Ascertain whether the observed declines on certain benchmarks, such as MMLU, after supervised fine-tuning of LLaDA 8B are due to the suboptimal quality of the supervised fine-tuning dataset.

Background

After supervised fine-tuning, LLaDA 8B improves on most downstream tasks, but some metrics decline, including MMLU.

The authors speculate that these declines may be attributable to the quality of the SFT data, suggesting an unresolved question regarding data curation and its impact on post-training performance.

References

A few metrics, such as MMLU, showed declines, and we conjecture may be due to the suboptimal quality of the SFT data.

Large Language Diffusion Models (2502.09992 - Nie et al., 14 Feb 2025) in Benchmark Results, Section 4.2