Scaling Diffusion Language Models via Adaptation from Autoregressive Models (2410.17891v1)
Abstract: Diffusion LLMs (DLMs) have emerged as a promising new paradigm for text generative modeling, potentially addressing limitations of autoregressive (AR) models. However, current DLMs have been studied at a smaller scale compared to their AR counterparts and lack fair comparison on LLMing benchmarks. Additionally, training diffusion models from scratch at scale remains challenging. Given the prevalence of open-source AR LLMs, we propose adapting these models to build text diffusion models. We demonstrate connections between AR and diffusion modeling objectives and introduce a simple continual pre-training approach for training diffusion models. Through systematic evaluation on LLMing, reasoning, and commonsense benchmarks, we show that we can convert AR models ranging from 127M to 7B parameters (GPT2 and LLaMA) into diffusion models DiffuGPT and DiffuLLaMA, using less than 200B tokens for training. Our experimental results reveal that these models outperform earlier DLMs and are competitive with their AR counterparts. We release a suite of DLMs (with 127M, 355M, and 7B parameters) capable of generating fluent text, performing in-context learning, filling in the middle without prompt re-ordering, and following instructions \url{https://github.com/HKUNLP/DiffuLLaMA}.
- Shansan Gong (14 papers)
- Shivam Agarwal (10 papers)
- Yizhe Zhang (127 papers)
- Jiacheng Ye (21 papers)
- Lin Zheng (31 papers)
- Mukai Li (17 papers)
- Chenxin An (17 papers)
- Peilin Zhao (127 papers)
- Wei Bi (62 papers)
- Jiawei Han (263 papers)
- Hao Peng (291 papers)
- Lingpeng Kong (134 papers)