Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling up Masked Diffusion Models on Text (2410.18514v3)

Published 24 Oct 2024 in cs.AI, cs.CL, and cs.LG

Abstract: Masked diffusion models (MDMs) have shown promise in LLMing, yet their scalability and effectiveness in core language tasks, such as text generation and language understanding, remain underexplored. This paper establishes the first scaling law for MDMs, demonstrating a scaling rate comparable to autoregressive models (ARMs) and a relatively small compute gap. Motivated by their scalability, we train a family of MDMs with up to 1.1 billion (B) parameters to systematically evaluate their performance against ARMs of comparable or larger sizes. Fully leveraging the probabilistic formulation of MDMs, we propose a simple yet effective unsupervised classifier-free guidance that effectively exploits large-scale unpaired data, boosting performance for conditional inference. In language understanding, the 1.1B MDM outperforms the 1.1B TinyLlama model trained on the same data across four of eight zero-shot benchmarks. Notably, it achieves competitive math reasoning ability with the 7B Llama-2 model on the GSM8K dataset. In text generation, MDMs with 16 times more pre-training time offer a flexible trade-off against ARMs with the accelerated sampling technique KV-Cache: MDMs match ARMs in performance while being 1.4 times faster during sampling. Moreover, MDMs address challenging tasks for ARMs by effectively handling bidirectional reasoning and adapting to temporal shifts in data. Notably, a 1.1B MDM breaks the reverse curse encountered by much larger ARMs with significantly more data and computation, such as 13B Llama-2 and 175B GPT-3. Our code is available at https://github.com/ML-GSAI/SMDM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021.
  3. The reversal curse: Llms trained on” a is b” fail to learn” b is a”. arXiv preprint arXiv:2309.12288, 2023.
  4. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, 2020.
  5. Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  6. A continuous time framework for discrete denoising models. Advances in Neural Information Processing Systems, 35:28266–28279, 2022.
  7. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11315–11325, 2022.
  8. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  9. Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022.
  10. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  11. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  12. Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  13. Continuous diffusion for categorical data. arXiv preprint arXiv:2211.15089, 2022.
  14. Understanding emergent abilities of language models from the loss perspective. arXiv preprint arXiv:2403.15796, 2024.
  15. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  16. A framework for few-shot language model evaluation, 07 2024. URL https://zenodo.org/records/12608602.
  17. Discrete flow matching. arXiv preprint arXiv:2407.15595, 2024.
  18. Diffuseq: Sequence to sequence text generation with diffusion models. arXiv preprint arXiv:2210.08933, 2022.
  19. Bayesian flow networks. arXiv preprint arXiv:2308.07037, 2023.
  20. Likelihood-based diffusion language models. Advances in Neural Information Processing Systems, 36, 2024.
  21. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. arXiv preprint arXiv:2210.17432, 2022.
  22. Diffusionbert: Improving generative masked language models with diffusion models. arXiv preprint arXiv:2211.15029, 2022.
  23. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  24. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  25. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  26. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  27. Autoregressive diffusion models. arXiv preprint arXiv:2110.02037, 2021a.
  28. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454–12465, 2021b.
  29. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  30. Cllms: Consistency large language models. arXiv preprint arXiv:2403.00835, 2024.
  31. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
  32. Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343, 2022.
  33. I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  34. Discrete diffusion language modeling by estimating the ratios of the data distribution. arXiv preprint arXiv:2310.16834, 2023.
  35. Diffusion guided language modeling. arXiv preprint arXiv:2408.04220, 2024.
  36. Are we falling in a middle-intelligence trap? an analysis and mitigation of the reversal curse. arXiv preprint arXiv:2311.07468, 2023.
  37. Tess: Text-to-text self-conditioned simplex diffusion. arXiv preprint arXiv:2305.08379, 2023.
  38. Concrete score matching: Generalized score matching for discrete data. Advances in Neural Information Processing Systems, 35:34532–34545, 2022.
  39. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  40. OpenAI. ChatGPT: Optimizing Language Models for Dialogue. OpenAI blog, November 2022. URL https://openai.com/blog/chatgpt/.
  41. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024.
  42. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016.
  43. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp.  311–318, 2002.
  44. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URL https://arxiv.org/abs/2406.17557.
  45. Alec Radford. Improving language understanding by generative pre-training, 2018.
  46. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  47. Simple and effective masked diffusion language models. arXiv preprint arXiv:2406.07524, 2024.
  48. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  49. Social iqa: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4463–4473, 2019.
  50. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  51. Simplified and generalized masked diffusion for discrete data. arXiv preprint arXiv:2406.04329, 2024.
  52. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 06 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B.
  53. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp.  2256–2265. PMLR, 2015.
  54. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  55. Self-conditioned embedding diffusion for text generation. arXiv preprint arXiv:2211.04236, 2022.
  56. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  57. Score-based continuous-time discrete diffusion models. arXiv preprint arXiv:2211.16750, 2022.
  58. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  59. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  60. Ashish Vaswani. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
  61. Temporal quality degradation in ai models. Scientific Reports, 12(1):11654, 2022.
  62. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
  63. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022b.
  64. Unifying bayesian flow networks and diffusion models through stochastic differential equations. arXiv preprint arXiv:2404.15766, 2024.
  65. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  66. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  67. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024.
  68. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. arXiv preprint arXiv:2409.02908, 2024.
  69. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023.
Citations (2)

Summary

  • The paper establishes a novel scaling law showing that 1.1B parameter masked diffusion models achieve competitive performance compared to autoregressive models.
  • The research introduces an unsupervised classifier-free guidance technique that efficiently leverages unpaired data to outperform a larger 1.5B GPT-2 model on zero-shot benchmarks.
  • The study demonstrates that masked diffusion models overcome traditional autoregressive limitations by enabling efficient bidirectional reasoning in text generation tasks.

Overview of "Scaling up Masked Diffusion Models on Text"

The research paper "Scaling up Masked Diffusion Models on Text" explores advancements in Masked Diffusion Models (MDMs) within the domain of LLMing. Traditionally, Autoregressive Models (ARMs) have dominated this field due to their sequential processing nature, which aligns well with language generation tasks. However, ARMs suffer from limitations in context utilization and bidirectional reasoning, problems which MDMs aim to overcome.

Key Findings and Contributions

  1. Scalability and Performance: The paper explores the scalability of MDMs, establishing the first scaling law for these models, and finds that MDMs exhibit a scaling rate comparable to ARMs, albeit with a manageable compute gap. The research includes training MDMs up to 1.1 billion parameters to benchmark them against ARMs of similar or larger sizes.
  2. Unsupervised Classifier-Free Guidance (CFG): A novel proposition of the paper is the unsupervised CFG for MDMs, which leverages large-scale unpaired data. This approach circumvents the need for extensive labeled datasets, demonstrating significant enhancement in conditional inference tasks. The results showed that the 1.1B MDM outperformed a larger 1.5B GPT-2 model on multiple zero-shot benchmarks, highlighting the efficiency of the proposed strategy.
  3. Efficiency in Text Generation: A key advantage of MDMs highlighted in the paper is their improved efficiency in text generation compared to traditional ARMs, which use KV-cache. MDMs offered a flexible trade-off: either matching ARMs' performance at a faster rate or achieving higher quality with a higher computational cost.
  4. Addressing ARM Limitations: MDMs effectively overcome ARM limitations such as difficulties in bidirectional reasoning and responsiveness to temporal data shifts. For instance, in tasks where ARMs like Llama-2 and GPT-3 encounter the reverse curse (struggling with bidirectional relationships), a well-tuned MDM with fewer parameters and less data managed to break through this bottleneck.
  5. Implications for Future AI Development: The findings suggest that MDMs hold significant potential for expanding the capabilities of AI in LLMing, particularly as an alternative to ARMs. Their inherent structure allows more robust handling of diverse tasks without being overly reliant on large, labeled datasets.

Theoretical and Practical Implications

Theoretically, the introduction of a unified scaling law for MDMs paves the way for a standardized framework to develop future text-based diffusion models. This development facilitates targeted upgrades to model architecture and training regimes.

Practically, the efficiencies gained through MDM's innovative approaches, such as the unsupervised CFG, have meaningful implications for reducing the resource burden in training powerful LLMs, enabling broader accessibility and implementation in real-world applications.

Scope for Future Research

The implications of MDMs' equivalence in performance to larger ARMs while using less data present intriguing possibilities for the industry's direction. Future research can focus on exploring emergent behaviors in even larger MDMs and assessing their capabilities in specialized tasks like dialogue systems and other interactive AI applications. Furthermore, MDMs' approach to mitigating compute overhead presents an avenue for sustainable AI model development, which is crucial given the increasing complexity and resource demands of modern AI systems.

In summary, the paper makes a compelling case for the scalability, efficiency, and potential of Masked Diffusion Models as competitive alternatives to traditional ARMs in LLMing. The combination of theoretical insights and practical advancements positions MDMs as a promising direction in the ongoing evolution of artificial intelligence.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com