- The paper proposes a diffusion-based LLM that replaces autoregressive generation with parallel token denoising for faster inference.
- It employs standard Transformer architectures paired with a novel discrete diffusion training method optimized on NVIDIA H100s.
- Empirical results on coding benchmarks show up to 10× speedup while maintaining competitive output quality and scalability.
Mercury: Ultra-Fast LLMs Based on Diffusion
Mercury proposes a new class of commercial-scale LLMs employing a diffusion-based generation mechanism, diverging from the prevailing autoregressive (AR) paradigm. The work demonstrates that diffusion models can achieve highly competitive accuracy on coding-related benchmarks while establishing a clear lead on inference speed. The core contribution lies in transplanting the generative strengths of diffusion (historically applied to continuous domains like images and video) into the discrete token sequence domain of LLMing, at modern LLM scale. This enables effective parallel token generation and thus greatly improved inference throughput without sacrificing competitive output quality.
Architecture and Training
Mercury models employ standard Transformer architectures, deliberately maintaining architectural compatibility with established optimization and serving pipelines. The innovation arises not at the network level but within the generative process: Mercury models are trained using a discrete diffusion objective. In this approach, training alternates between noising token sequences through a Markov chain and learning a denoising model that reconstructs data from partially corrupted versions. At inference time, the generative process starts from noise and iteratively refines the output sequence in parallel via learned denoising steps.
Several engineering considerations underlie the scaling of this approach:
- Mercury leverages large-scale proprietary training datasets, spanning trillions of tokens, and benefits from a tightly optimized pipeline on clusters of NVIDIA H100 GPUs.
- The models maintain a maximum context length of 32k tokens natively, with up to 128k using established context extension strategies.
Loss optimization is based on a weighted sum of cross-entropy over token sequences at different noise levels. The denoising procedure is carefully tuned for parallel execution on modern accelerator hardware.
Inference and Serving Pipeline
The generation process in Mercury is inherently parallel: multiple tokens are sampled and refined in blocks at each denoising step. This contrasts with AR models, which are limited by strict left-to-right token-by-token decoding. To capitalize on this, Mercury employs a proprietary inference engine that features dynamically batched diffusion sampling, custom kernel optimizations, and dynamic navigation of the speed-quality trade-off.
Key inference properties include:
- Throughput: Mercury achieves 1109 tokens/sec (Mini) and 737 tokens/sec (Small) on H100s, according to independent benchmarking. These rates are up to 10× faster than comparable open- and closed-weights AR models within the same quality tier.
- Prompting: The model supports both unconditional sequence generation and conditional generations akin to traditional LLMs, backing standard prompting modalities (zero-shot, few-shot, chain-of-thought).
- API Compatibility: A drop-in replacement for existing AR APIs (e.g., OpenAI standard), facilitating rapid migration and interoperability for current downstream systems.
Empirical Evaluation
The Mercury Coder variants are extensively evaluated on widely adopted coding benchmarks (HumanEval, MBPP, EvalPlus, MultiPL-E, FIM, LiveCodeBench, BigCodeBench), and their output is validated both quantitatively and via large-scale human ratings (Copilot Arena).
Highlights from the reported results:
Model |
HumanEval |
MBPP |
MultiPL-E |
FIM Avg. |
Speed (t/s) |
Mercury Coder Mini |
88.0 |
77.1 |
74.1 |
82.2 |
1109 |
Mercury Coder Small |
90.0 |
76.6 |
76.2 |
84.8 |
737 |
Claude 3.5 Haiku |
86.0 |
78.0 |
72.3 |
45.5 |
61 |
GPT-4o Mini |
88.0 |
74.6 |
72.0 |
60.9 |
59 |
Codestral 2501 |
85.0 |
72.2 |
73.4 |
82.5 |
171 |
On FIM-style infilling tasks—crucial for code completion and editing—both Mercury variants surpass all compared models, including frontier open- and closed-weights models, establishing a new state-of-the-art for fill-in-the-middle code synthesis under high-throughput constraints.
On Copilot Arena, Mercury Coder Mini ranks second in output quality but first by a substantial margin in latency, returning completions in 25 ms—approximately 4× faster than GPT-4o Mini.
Theoretical and Practical Implications
Parallel Token Generation: By sidestepping AR dependence, diffusion-based generation enables substantial latency reductions in large-batch/user settings, making LLM-based tooling feasible for real-time and edge applications previously constrained by AR decoding bottlenecks.
Scalability: The denoising paradigm exhibits promising scaling characteristics, with larger Mercury variants delivering monotonic improvements in accuracy on all benchmarks. This suggests that diffusion approaches could continue to harvest quality improvements as model and data size increase, potentially closing any residual quality gap with the largest AR models.
Integration and Deployment: The architecture's compatibility with established serving pipelines and APIs removes typical friction associated with deploying fundamentally novel models. Migration from AR to diffusion-based backends can be accomplished transparently for most downstream services.
Limitations and Future Directions
- The present results focus on coding tasks; transferability of the speed-quality trade-off to general natural language or multi-modal settings, though plausible, awaits further empirical validation.
- While Mercury models maintain competitive accuracy, absolute state-of-the-art on some coding metrics is held by larger AR models (e.g., GPT-4o, Claude 3.5 Sonnet). Continued investigation into scaling laws for dLLMs and refining denoising objectives will be necessary to advance over the AR baseline in unconstrained generation tasks.
- The proprietary nature of some inference infrastructure may limit reproducibility or community-driven optimization in open settings.
Outlook
The introduction of diffusion mechanisms at this scale for LLMs formalizes diffusion not only as a method for high-fidelity generative modeling of continuous data but as a highly practical architecture for discrete sequence tasks with demanding throughput and latency needs. Given the demonstrated compatibility with fine-tuning, alignment (RLHF/DPO), and prompting regimes, diffusion-based LLMs such as Mercury are positioned as a strong basis for next-generation AI systems operating in latency-sensitive production scenarios. Further research into architectures, objectives, and tokenization strategies tailored for diffusion may yield further advances in both performance and controllability in sequence modeling.