Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model for Text Generation and Modular Control (2210.17432v2)

Published 31 Oct 2022 in cs.CL and cs.LG

Abstract: Despite the growing success of diffusion models in continuous-valued domains (e.g., images), similar efforts for discrete domains such as text have yet to match the performance of autoregressive LLMs. In this work, we present SSD-LM -- a diffusion-based LLM with two key design choices. First, SSD-LM is semi-autoregressive, iteratively generating blocks of text, allowing for flexible output length at decoding time while enabling local bidirectional context updates. Second, it is simplex-based, performing diffusion on the natural vocabulary space rather than a learned latent space, allowing us to incorporate classifier guidance and modular control using off-the-shelf classifiers without any adaptation. We evaluate SSD-LM on unconstrained text generation benchmarks, and show that it matches or outperforms strong autoregressive GPT-2 models across standard quality and diversity metrics, while vastly outperforming diffusion-based baselines. On controlled text generation, SSD-LM also outperforms competitive baselines, with an extra advantage in modularity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (85)
  1. Logistic-normal distributions:Some properties and uses. Biometrika, 67(2):261–272.
  2. Structured denoising diffusion models in discrete state-spaces. In Proc. NeurIPS.
  3. Eugene Bagdasaryan and Vitaly Shmatikov. 2022. Spinning language models: Risks of propaganda-as-a-service and countermeasures. In 2022 IEEE Symposium on Security and Privacy (SP), pages 1532–1532. IEEE Computer Society.
  4. József Bakosi and J. Raymond Ristorcelli. 2013. A stochastic diffusion process for the dirichlet distribution. arXiv: Mathematical Physics.
  5. Tweeteval: Unified benchmark and comparative evaluation for tweet classification. In Findings of EMNLP.
  6. On the dangers of stochastic parrots: Can language models be too big? In Proc. FAccT.
  7. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow.
  8. Language models are few-shot learners. ArXiv, abs/2005.14165.
  9. Extracting training data from large language models. In USENIX Security Symposium, pages 2633–2650.
  10. Cocon: A self-supervised approach for controlled text generation. In Proc. ICLR.
  11. Analog bits: Generating discrete data using diffusion models with self-conditioning. ArXiv, abs/2208.04202.
  12. Palm: Scaling language modeling with pathways. ArXiv, abs/2204.02311.
  13. Plug and play language models: A simple approach to controlled text generation. In Proc. ICLR.
  14. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL-HLT.
  15. Prafulla Dhariwal and Alex Nichol. 2021. Diffusion models beat gans on image synthesis. ArXiv, abs/2105.05233.
  16. Hierarchical neural story generation. In Proc. ACL.
  17. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online. Association for Computational Linguistics.
  18. Mask-predict: Parallel decoding of conditional masked language models. In Proc. EMNLP.
  19. Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus.
  20. Diffuseq: Sequence to sequence text generation with diffusion models. ArXiv, abs/2210.08933.
  21. Non-autoregressive neural machine translation. In Proc. ICLR.
  22. Levenshtein transformer. In Proc. NeurIPS.
  23. Don’t stop pretraining: Adapt language models to domains and tasks. In Proc. ACL.
  24. Denoising diffusion probabilistic models. In Proc. NeurIPS.
  25. Jonathan Ho and Tim Salimans. 2021. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.
  26. Video diffusion models. ArXiv, abs/2204.03458.
  27. Towards decoding as continuous optimisation in neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 146–156, Copenhagen, Denmark. Association for Computational Linguistics.
  28. The curious case of neural text degeneration. In Proc. ICLR.
  29. Argmax flows and multinomial diffusion: Learning categorical distributions. In Proc. NeurIPS.
  30. Fast decoding in sequence models using discrete latent variables. In Proc. ICML, pages 2390–2399. PMLR.
  31. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
  32. Diffwave: A versatile diffusion model for audio synthesis. In Proc. ICLR.
  33. Gedi: Generative discriminator guided sequence generation. In Proc. Findings of EMNLP.
  34. Language generation models can cause harm: So what can we do about it? an actionable survey. arXiv preprint arXiv:2210.07700.
  35. Controlled text generation as continuous optimization with multiple constraints. In Proc. NeurIPS.
  36. Constrained sampling from language models via langevin dynamics in embedding spaces. In Proc. EMNLP.
  37. Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proc. EMNLP.
  38. Iterative refinement in the continuous space for non-autoregressive neural machine translation. In Proc. EMNLP.
  39. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119, San Diego, California. Association for Computational Linguistics.
  40. Diffusion-lm improves controllable text generation. ArXiv, abs/2205.14217.
  41. Dexperts: Decoding-time controlled text generation with experts and anti-experts. In Proc. ACL.
  42. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692.
  43. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In Proc. ICLR.
  44. NeuroLogic decoding: (un)supervised neural text generation with predicate logic constraints. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4288–4299, Online. Association for Computational Linguistics.
  45. Flowseq: Non-autoregressive conditional sequence generation with generative flow. In Proc. EMNLP.
  46. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
  47. Locally typical sampling. ArXiv, abs/2202.00666.
  48. On distillation of guided diffusion models. ArXiv, abs/2210.03142.
  49. Mix and match: Learning-free controllable text generationusing energy language models. In Proc. ACL.
  50. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119–126.
  51. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839–849.
  52. A systematic characterization of sampling algorithms for open-ended language generation. In Proc. AACL.
  53. Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In Proc. ICML.
  54. The e2e dataset: New challenges for end-to-end generation. arXiv preprint arXiv:1706.09254.
  55. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Proc. NAACL.
  56. Threat scenarios and best practices to detect neural fake news. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1233–1249.
  57. A plug-and-play method for controlled text generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3973–3997, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  58. Mauve: Measuring the gap between neural text and human text using divergence frontiers. In Proc. NeurIPS.
  59. Bang: Bridging autoregressive and non-autoregressive generation with large scale pretraining. In Proc. ICML, pages 8630–8639. PMLR.
  60. Cold decoding: Energy-based constrained text generation with langevin dynamics. ArXiv, abs/2202.11705.
  61. Language models are unsupervised multitask learners.
  62. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR.
  63. High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685.
  64. Neural machine translation of rare words with subword units. In Proc. ACL.
  65. Societal biases in language generation: Progress and challenges. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4275–4293, Online. Association for Computational Linguistics.
  66. Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. ICML.
  67. Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203.
  68. Denoising diffusion implicit models. In Proc. ICLR.
  69. Yang Song and Stefano Ermon. 2019. Generative modeling by estimating gradients of the data distribution. In Proc. NeurIPS.
  70. Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. ArXiv, abs/2206.04119.
  71. Attention is all you need. In Proc. NeurIPS.
  72. Universal adversarial triggers for attacking and analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2153–2162, Hong Kong, China. Association for Computational Linguistics.
  73. Imitation attacks and defenses for black-box machine translation systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5531–5546, Online. Association for Computational Linguistics.
  74. Semi-autoregressive neural machine translation. In Proc. EMNLP.
  75. Non-autoregressive machine translation with auxiliary regularization. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 5377–5384.
  76. Ethical and social risks of harm from language models.
  77. Taxonomy of risks posed by language models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, page 214–229, New York, NY, USA. Association for Computing Machinery.
  78. Neural text generation with unlikelihood training. In Proc. ICLR.
  79. Protein structure generation via folding diffusion. ArXiv, abs/2209.15611.
  80. Kevin Yang and Dan Klein. 2021. Fudge: Controlled text generation with future discriminators. In Proc. NAACL.
  81. Defending against neural fake news. Advances in neural information processing systems, 32.
  82. Trading off diversity and quality in natural language generation. In Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval), pages 25–33.
  83. Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068.
  84. Character-level convolutional networks for text classification. In Proc. NeurIPS.
  85. 3d shape generation and completion through point-voxel diffusion. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 5806–5815. IEEE.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Xiaochuang Han (23 papers)
  2. Sachin Kumar (68 papers)
  3. Yulia Tsvetkov (142 papers)
Citations (64)

Summary

Overview of the Ssd-LM Model: Advancing Text Generation Through Diffusion and Semi-autoregression

The paper introduces Ssd-LM, a diffusion-based LLM that seeks to address the challenges associated with applying diffusion processes to text generation. Recognizing the limitations of current diffusion models in discrete text domains, the work introduces two key innovations: semi-autoregressive generation and simplex-based diffusion.

Ssd-LM departs from traditional autoregressive LLMs (AR-LMs) by iteratively generating blocks of text, allowing for greater flexibility in output length while retaining the benefits of autoregressive setups, such as context awareness. This approach strikes a novel balance between token-by-token autoregressive generation and non-autoregressive models, which produce entire sequences simultaneously and require predefined sequence lengths. Moreover, this model enables refinement within token blocks, circumventing one of the principal drawbacks of conventional token-level autoregressive generation where earlier tokens cannot be modified once generated.

The model performs diffusion directly on the vocabulary simplex rather than latent spaces, enabling seamless integration of classifier guidance for controlled text generation. This design choice allows Ssd-LM to leverage existing classifiers without adaptation, supporting modularity in controlled generation tasks.

Key Findings and Implications

  1. Benchmarking Against AR-LMs: Ssd-LM demonstrates superior or comparable performance compared to GPT-2 models in unconstrained text generation tasks, notably achieving high scores across quality metrics like MAUVE and diversity metrics, including Dist-n and Zipf coefficients.
  2. Flexibility and Modularity: The model excels in controlled text generation by integrating off-the-shelf sentiment classifiers without retraining, a feature that distinguishes it from earlier diffusion-based approaches which are constrained by the necessity to develop custom classifiers owing to different input space representations.
  3. Controlled Generation: On sentiment-guided generation tasks, Ssd-LM achieves high target accuracy and balanced performance across relevance and fluency measures compared to previous text generation approaches that require customized architectures or control functions.

Theoretical Contributions

The paper extends the theoretical framework of diffusion models by adapting simplex-based transformations for discrete text, innovating upon existing methods typically relying on character or byte-level encoding. The semi-autoregressive arrangement not only facilitates variable-length sequence generation but also retains the benefits of both autoregressive and diffusion paradigms in producing coherent token blocks with bidirectional context.

Future Directions

The paper suggests avenues for further research, such as optimizing the sample efficiency and the computational cost associated with decoding. Additionally, exploring more dynamic block length configurations during both training and generation could lead to even more adaptable text generation models. As diffusion models continue to advance in other domains such as images and audio, the innovations presented here open up possibilities for more organically controlled, flexible LLMs.

Overall, the Ssd-LM model represents a significant evolution in applying diffusion processes to text generation, addressing previous limitations and offering practical benefits in modular and controlled text generation scenarios. Future work could leverage these advancements to tackle complex LLMing tasks across diverse applications and languages.

Youtube Logo Streamline Icon: https://streamlinehq.com