Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Diffusion Language Models via Adaptation from Autoregressive Models (2410.17891v1)

Published 23 Oct 2024 in cs.CL

Abstract: Diffusion LLMs (DLMs) have emerged as a promising new paradigm for text generative modeling, potentially addressing limitations of autoregressive (AR) models. However, current DLMs have been studied at a smaller scale compared to their AR counterparts and lack fair comparison on LLMing benchmarks. Additionally, training diffusion models from scratch at scale remains challenging. Given the prevalence of open-source AR LLMs, we propose adapting these models to build text diffusion models. We demonstrate connections between AR and diffusion modeling objectives and introduce a simple continual pre-training approach for training diffusion models. Through systematic evaluation on LLMing, reasoning, and commonsense benchmarks, we show that we can convert AR models ranging from 127M to 7B parameters (GPT2 and LLaMA) into diffusion models DiffuGPT and DiffuLLaMA, using less than 200B tokens for training. Our experimental results reveal that these models outperform earlier DLMs and are competitive with their AR counterparts. We release a suite of DLMs (with 127M, 355M, and 7B parameters) capable of generating fluent text, performing in-context learning, filling in the middle without prompt re-ordering, and following instructions \url{https://github.com/HKUNLP/DiffuLLaMA}.

Scaling Diffusion LLMs via Adaptation from Autoregressive Models

The paper "Scaling Diffusion LLMs via Adaptation from Autoregressive Models" presents an innovative methodology for scaling Diffusion LLMs (DLMs) by leveraging pre-trained autoregressive (AR) LLMs. Diffusion models offer a promising alternative to AR models by facilitating parallel and any-order text generation, which could effectively address some limitations inherent in the sequential nature of AR models. However, until now, the scaling of DLMs has been limited by computational challenges and the absence of optimization techniques akin to those available for AR models.

Key Contributions

The authors propose a novel approach to adapt existing AR models into DLMs, bridging the objective differences between these paradigms. They introduce a technique involving attention mask annealing and shift operations to transform AR model architectures, enabling them to function effectively as diffusion models. This approach facilitates the seamless conversion of models such as GPT2 and LLaMA2, scaling from 127M to a substantial 7B parameters, utilizing less than 200 billion tokens for training.

Numerical Results and Benchmarks

The experimental analysis underscores that adapted DLMs not only exceed the performance of previous smaller diffusion models but also remain competitive with AR counterparts across a variety of tasks. Notably, the DiffuGPT models surpass GPT2, enhancing performance in LLMing, reasoning, and commonsense understanding benchmarks. The paper also highlights the superiority of DLMs in global reasoning tasks, such as mathematics and coding, where traditional AR models fall short. These findings are pivotal as they establish that scaling DLMs through adaptation is both feasible and beneficial for achieving state-of-the-art performance.

Implications and Future Directions

The successful adaptation and scaling of DLMs from AR models open significant pathways for future research in natural language processing. This work suggests that the constraints of sequential AR models might be overcome, ushering in models capable of parallel, flexible, and contextually dynamic text generation. Moreover, the computational efficiency, as indicated by the reduced latency of DiffuLLaMA for long-sequence generation, presents a practical advantage for deploying large-scale LLMs in real-world applications.

Future explorations could delve into instruction tuning and inference-time planning methodologies to further harness the inherent capabilities of DLMs, potentially surpassing the limitations of AR models in more comprehensive and dynamic NLP tasks. Additionally, refining the fine-tuning and sampling techniques could enhance the adaptive qualities of DLMs across diverse domains and contexts.

Overall, this paper contributes a substantial advancement in AI research, highlighting the potential of DLMs as a viable alternative to AR LLMing by leveraging existing computational investments in pre-trained models. Such innovations will likely catalyze continued exploration and optimization of diffusion-based frameworks in text generation tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Shansan Gong (14 papers)
  2. Shivam Agarwal (10 papers)
  3. Yizhe Zhang (127 papers)
  4. Jiacheng Ye (21 papers)
  5. Lin Zheng (31 papers)
  6. Mukai Li (17 papers)
  7. Chenxin An (17 papers)
  8. Peilin Zhao (127 papers)
  9. Wei Bi (62 papers)
  10. Jiawei Han (263 papers)
  11. Hao Peng (291 papers)
  12. Lingpeng Kong (134 papers)
Citations (1)