Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed (2512.14067v1)

Published 16 Dec 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Diffusion LLMs (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) LLMs when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models' weight distributions than fully bidirectional modeling, in addition to its known benefit of enabling KV caching, and leads to a win-win in accuracy and efficiency. Second, to mitigate the training-test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs' attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM family, which outperforms state-of-the-art AR models and dLMs, e.g., our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B, respectively.

Abstract PDF Chat (Pro)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Whiteboard

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Paper to Video (Beta)

Generate a video overview of this paper.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper is about making LLMs write faster without losing accuracy. Today’s popular models usually write one word at a time, left to right. That’s safe but slow. The authors show how to turn those slow-but-smart models into a different kind called diffusion LLMs that can fill in many words at once, speeding things up, while keeping the original model’s skills.

What questions did the researchers ask?

They focused on three simple questions:

How can we convert a well-trained, left-to-right model into a faster, fill-in-many-words model without breaking what it already knows?
What training tricks and attention patterns help keep accuracy high while enabling speed?
How do choices like “how big a chunk to fill at once” and “which words to hide during training” affect quality and speed?

How did they do it?

Think of writing a story in two ways:

Autoregressive (AR): You write one word at a time, always from left to right.
Diffusion (dLM): You start with a sentence that has blanks, and you fill in the blanks step by step. You can fill several blanks at once, which is faster if you can trust your guesses.

The paper’s main idea: start with a strong AR model and carefully retrain it so it becomes a dLM that fills in words in parallel. Two key tricks make this work.

Trick 1: Block-wise attention with clean context

Analogy: Imagine a long story broken into paragraphs (blocks). When you edit paragraph 3, you can fully read paragraphs 1 and 2 (they’re clean, no blanks) so you know the context. Inside paragraph 3, you’re allowed to look both left and right to fill blanks, but you can’t peek into paragraph 4 yet.

“Block-wise” means the model looks both ways inside the current block, but only looks left across blocks (keeps the story’s forward flow).
“Clean context” means previous blocks are complete and uncorrupted during training, just like they will be when the model is used.
Why this helps:
- It lets the model use a fast memory trick called KV caching (like keeping notes from what you’ve already read so you don’t reread it).
- It changes the original model’s brain (its weights) less, so it keeps its skills.

They also found it’s better to directly predict the hidden words instead of predicting the “next” word (this is called removing token shift).

Trick 2: Position-dependent masking (hide more at the end)

When the model writes in practice, even the diffusion model still tends to settle words from left to right. But most previous training hid words uniformly at random. That’s a mismatch.

Their fix: during training, hide later words slightly more often than earlier ones. This matches how the model actually fills in blanks at test time, especially in the final steps. Result: better quality when generating many words in parallel.

Picking good block sizes and how parallel to go

Block size (how big a paragraph is) matters. Too small: not enough context to guess blanks well. Too big: too many blanks and too much confusion. There’s a sweet spot that depends on model size.
At test time, using larger blocks often helps when you want to generate more words in one go.

They also show that training longer steadily improves how confidently and correctly the model fills blanks, which lets you push parallel generation further without losing accuracy.

What did they find, and why does it matter?

Here are the main takeaways, in plain terms:

Keep the model’s habits: Training with block-wise attention and clean context preserves the original model’s abilities better than fully bidirectional training used before. It also enables fast memory reuse (KV cache), boosting speed.
No need for “token shift”: Directly predicting the hidden word is simpler and works better than predicting the next word after it.
Match training to reality: Hiding more later-position words during training mirrors how decoding actually proceeds and improves results, especially when generating many tokens at once.
Choose block sizes wisely: There’s a sweet spot that gives the best balance between context and confusion.
Longer training helps: As the model learns to estimate word likelihoods better, you can generate more words in parallel with less risk.

Why it matters: The authors built a family of models called Efficient-DLM (1.5B, 4B, 8B parameters) that beat prior systems on accuracy-versus-speed. For example, Efficient-DLM 8B reaches similar or slightly better accuracy than Qwen3 8B while being able to run much faster when generating in parallel; compared to Dream 7B, it gets about 5.4% higher accuracy with around 4.5× higher throughput. Compared to Qwen3 4B, it gets about 2.7% higher accuracy with about 2.7× higher throughput. In short, you get strong results faster.

They also show an extra benefit: because diffusion models can look both left and right within a block, they produce better text embeddings (useful for search, clustering, and matching text), outperforming same-size AR models on many embedding benchmarks.

What could this change?

Faster apps: Chatbots, coding assistants, and math solvers can respond faster, especially on hardware that can’t handle big batches efficiently.
Lower costs and energy: More tokens per second means better use of GPUs and less waiting, potentially saving money and power.
One model, many speeds: You can “turn the dial” to trade a little accuracy for a lot of speed or vice versa by adjusting how many tokens to fill in at once, without retraining a new model.
Better embeddings: Tasks that need understanding both past and future words (like semantic search or document comparison) may work better with these diffusion-style models.
Practical roadmap: The paper gives clear, actionable training rules for anyone wanting to convert a strong AR model into a fast, high-quality diffusion model.

In short, this work shows a reliable, scalable path to make LLMs much faster while keeping their smarts—by training them to fill in smart chunks with the right kind of attention, context, and masking.

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

We haven't generated follow-up questions for this paper yet.

Generate Now

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed (2512.14067v1)

Summary

Whiteboard

Paper to Video (Beta)

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they do it?

Trick 1: Block-wise attention with clean context

Trick 2: Position-dependent masking (hide more at the end)

Picking good block sizes and how parallel to go

What did they find, and why does it matter?

What could this change?

Open Problems

Continue Learning

Authors (14)

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed (2512.14067v1)

Sponsor

Summary

Whiteboard

Paper to Video (Beta)

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they do it?

Trick 1: Block-wise attention with clean context

Trick 2: Position-dependent masking (hide more at the end)

Picking good block sizes and how parallel to go

What did they find, and why does it matter?

What could this change?

Open Problems

Continue Learning

Related Papers

Authors (14)

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research