Don't Retrain, Align: Adapting Autoregressive LMs to Diffusion LMs via Representation Alignment

Published 7 May 2026 in cs.LG and cs.AI | (2605.06885v1)

Abstract: Diffusion LLMs (DLMs) have recently demonstrated capabilities that complement standard autoregressive (AR) models, particularly in non-sequential generation and bidirectional editing. Although recent work has shown that pretrained autoregressive checkpoints can be converted into diffusion LLMs, existing recipes primarily transfer parameters through continued denoising training with objective- and attention-level modifications. We instead ask whether the internal representation geometry learned by next-token prediction can be explicitly preserved during AR-to-DLM conversion. We hypothesize that much of the semantic structure learned by AR pretraining can transfer across generation orders, and thus DLM training should be viewed as relearning the decoding path rather than relearning language representations. To investigate this, we introduce REPR-ALIGN, a representation alignment objective that adapts a bidirectional masked diffusion model to reuse representations from a pretrained AR model of identical architecture. Concretely, we align the hidden states of the DLM to the frozen AR model at every layer using cosine similarity, while optimizing the standard masked denoising objective. This simple alignment, with no adapters and no architectural changes beyond the attention mask, yields up to 4x training acceleration in our setting and is particularly effective in low-data regimes. Our results suggest that linguistic representations can transfer across generation order, and that representation alignment provides a simple and effective technique for training diffusion LLMs. Code is available at https://github.com/pengzhangzhi/Open-dLLM.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces REPR-ALIGN, a representation alignment method that converts AR models to DLMs without retraining from scratch.
It applies a layer-wise cosine loss to align the DLM student with a frozen AR teacher, achieving up to 4x faster training and significant pass@10 improvements.
Experimental results show that aligned DLMs outperform baselines on code generation tasks while using standard architectures and lower compute budgets.

Representation Alignment for Efficient Autoregressive-to-Diffusion LLM Conversion

Background and Context

Contemporary large-scale language modeling is dominated by autoregressive (AR) transformers, which model text by factorizing sequence likelihoods in a left-to-right fashion. Conversely, diffusion LLMs (DLMs) replace this sequential causality with non-sequential, any-order generation intrinsically supporting bidirectional infilling and iterative refinement. While substantial progress has been made in large-scale DLMs, their training remains significantly more resource-intensive, as denoising training requires learning an $\mathcal{O}(L!)$ space of generation paths compared to the single left-to-right trajectory in AR models.

Recent works have attempted to convert pretrained AR checkpoints into DLMs to leverage semantic structure acquired during next-token pretraining, but these approaches largely focus on adapting objectives, modifying attention masks, or re-initializing parameters, while leaving the question of whether the learned hidden-state geometry is preserved unaddressed. This paper introduces a representation alignment approach, REPR-ALIGN, for efficient AR-to-DLM conversion without architectural modification or retraining from scratch.

Methodology: Representation Alignment Objective

The proposed method starts with two models of identical architecture and initialization: a pretrained AR transformer (causal attention, serving as a frozen teacher) and a DLM student (bidirectional attention). During DLM training with a masked denoising objective, REPR-ALIGN supplements the standard cross-entropy loss with a layer-wise cosine similarity loss aligning each hidden state of the DLM student to its counterpart in the frozen AR teacher, using clean input for the teacher and masked input for the student.

The overall training loss is:

$\mathcal{L}(\theta) = \mathcal{L}_{\text{diff}}(\theta) + \lambda \mathcal{L}_{\text{align}}(\theta),$

where $\mathcal{L}_{\text{diff}}$ is the standard masked denoising loss and $\mathcal{L}_{\text{align}}$ is the mean cosine distance between AR and DLM hidden states. The alignment term acts as a representational anchor, compelling the DLM student to reuse semantic and syntactic features from the AR teacher, thereby reframing DLM training as a mechanism adaptation problem rather than de novo representation learning.

Experimental Results

Conversion experiments were performed using Qwen3 autoregressive models at multiple parameter scales (0.6B, 1.7B, 4B), adapting them to masked DLMs on the Nemotron-SFT-Code corpus with up to 50B tokens. The evaluation focuses on code generation using HumanEval and MBPP benchmarks.

Key empirical findings:

Conversion Efficiency and Sample Efficiency: With representation alignment, DLM adaptation achieves up to a 4x increase in training acceleration over pure denoising-based conversion, demonstrating substantially improved sample efficiency. For instance, at 1.7B scale, representation-aligned DLMs surpass baseline in pass@10 by +9.4 points under identical optimization budgets.
Performance Scaling: Alignment gains monotonically increase with model size, with oDLM (representation-aligned DLM) yielding absolute pass@10 improvements (e.g., from 24.9 to 31.0 at 0.6B, 31.1 to 40.5 at 1.7B on HumanEval).
Data Efficiency and Freezing: When training on a reduced subset (0.8B tokens), aligned DLMs matched or surpassed models trained on the full data stream, highlighting the redundancy of full-scale retraining once AR representations are anchored. Further, freezing embedding and MLP blocks during adaptation led to approximately 2x higher training throughput without comprising task performance, indicating that only mechanism adaptation is needed once representations are preserved.
Zero Architectural Overhead: The protocol requires no adapters, auxiliary modules, or capacity expansion—conversion is achieved simply by switching attention masks and applying the alignment loss.

Comparison with Public DLMs: The oDLM-4B model (aligned using REPR-ALIGN) exhibited favorable efficiency-accuracy tradeoffs. Despite having fewer parameters and dramatically reduced compute/data budgets, oDLM outperformed strong public DLMs like Dream-7B in code pass@10 (gains of +2.39 on HumanEval and +2.40 on HumanEval+).

Ablations

Critical choices were systematically ablated:

Cosine vs. L2 Alignment: Cosine loss outperformed L2 loss, revealing that feature direction rather than raw scale is the transferable signal for representation preservation.
Alignment Weight: The best trade-off was observed at $\lambda=10$ ; higher weight constrained denoising adaptation, while too low failed to leverage the AR teacher.
Layer Selection: Aligning all hidden states provided the strongest performance on both pass@1 and pass@10, indicating utility is distributed across network depth.

Implications and Future Directions

These results challenge the strict dichotomy between AR and DLM paradigms. The persistent geometry of learned AR representations is shown to be universal and substantially reusable, so long as the adaptation protocol explicitly preserves it. Thus, the bottleneck in DLM scaling is reframed from representation learning to mechanism adaptation, enabling substantial cost savings and short adaptation cycles. Representation alignment provides a new foundation for efficiently deploying non-sequential generation mechanisms with pretrained LMs.

From a theoretical perspective, this suggests the latent structure of next-token-trained models suffices to support both sequential and any-order token prediction. Practically, this enables efficient repurposing of existing AR checkpoints for DLM applications such as iterative refinement, bidirectional editing, and infilling, without retraining from scratch or introducing additional model complexity.

Further directions include:

Extending the alignment protocol to encoder-decoder transformers or multimodal DLMs.
Investigating bounded alignment schedules, hybrid attention regimes, or knowledge injection from more general teacher models.
Analyzing transferability across modalities and downstream tasks with alternate masking/generation strategies, particularly in instruction, reasoning, and multilingual settings.

Conclusion

This paper demonstrates that autoregressive pretraining imbues transformers with robust, order-agnostic internal representations. By anchoring DLM adaptation to these representations through layer-wise cosine alignment, one can efficiently repurpose AR LMs for bidirectional, non-sequential generation, achieving competitive or superior performance with drastically reduced compute and data requirements. Representation alignment is established as a practical, theoretically grounded recipe for scalable and efficient AR-to-DLM conversion.

(2605.06885)

Markdown Report Issue