Dice Question Streamline Icon: https://streamlinehq.com

Conjecture on the source of DecLLM’s superior pretraining performance

Establish whether the superior pretraining performance of decoder-only language models trained with a causal language modeling objective primarily arises from alignment between the causal LM training objective and downstream evaluation protocols rather than from intrinsically greater modeling capability.

Information Square Streamline Icon: https://streamlinehq.com

Background

During pretraining, DecLLM shows stronger zero- and few-shot performance compared to RedLLM, even when perplexity is comparable. After instruction tuning (finetuning), RedLLM matches or surpasses DecLLM on many downstream tasks, suggesting differences in adaptability.

To explain this discrepancy, the authors present a conjecture: DecLLM’s apparent superiority during pretraining may be due to a closer match between its causal LM objective and the evaluation setup, rather than an intrinsic advantage in modeling capability. This conjecture invites a targeted investigation to confirm or refute the proposed cause.

References

We conjecture that the superior pretraining performance of DecLLM is mostly caused by the higher degree of matching between its pretraining objective and the downstream evaluation, rather than its stronger capability.

Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model (2510.26622 - Zhang et al., 30 Oct 2025) in Section 6, subsection "RedLLM shows high adaptability: matching and even surpassing DecLLM across scales after finetuning"