Mercury: Ultra-Fast Language Models Based on Diffusion (2506.17298v1)

Published 17 Jun 2025 in cs.CL, cs.AI, and cs.LG

Abstract: We present Mercury, a new generation of commercial-scale LLMs based on diffusion. These models are parameterized via the Transformer architecture and trained to predict multiple tokens in parallel. In this report, we detail Mercury Coder, our first set of diffusion LLMs designed for coding applications. Currently, Mercury Coder comes in two sizes: Mini and Small. These models set a new state-of-the-art on the speed-quality frontier. Based on independent evaluations conducted by Artificial Analysis, Mercury Coder Mini and Mercury Coder Small achieve state-of-the-art throughputs of 1109 tokens/sec and 737 tokens/sec, respectively, on NVIDIA H100 GPUs and outperform speed-optimized frontier models by up to 10x on average while maintaining comparable quality. We discuss additional results on a variety of code benchmarks spanning multiple languages and use-cases as well as real-world validation by developers on Copilot Arena, where the model currently ranks second on quality and is the fastest model overall. We also release a public API at https://platform.inceptionlabs.ai/ and free playground at https://chat.inceptionlabs.ai

Summary

The paper proposes a diffusion-based LLM that replaces autoregressive generation with parallel token denoising for faster inference.
It employs standard Transformer architectures paired with a novel discrete diffusion training method optimized on NVIDIA H100s.
Empirical results on coding benchmarks show up to 10× speedup while maintaining competitive output quality and scalability.

Mercury: Ultra-Fast LLMs Based on Diffusion

Mercury proposes a new class of commercial-scale LLMs employing a diffusion-based generation mechanism, diverging from the prevailing autoregressive (AR) paradigm. The work demonstrates that diffusion models can achieve highly competitive accuracy on coding-related benchmarks while establishing a clear lead on inference speed. The core contribution lies in transplanting the generative strengths of diffusion (historically applied to continuous domains like images and video) into the discrete token sequence domain of LLMing, at modern LLM scale. This enables effective parallel token generation and thus greatly improved inference throughput without sacrificing competitive output quality.

Architecture and Training

Mercury models employ standard Transformer architectures, deliberately maintaining architectural compatibility with established optimization and serving pipelines. The innovation arises not at the network level but within the generative process: Mercury models are trained using a discrete diffusion objective. In this approach, training alternates between noising token sequences through a Markov chain and learning a denoising model that reconstructs data from partially corrupted versions. At inference time, the generative process starts from noise and iteratively refines the output sequence in parallel via learned denoising steps.

Several engineering considerations underlie the scaling of this approach:

Mercury leverages large-scale proprietary training datasets, spanning trillions of tokens, and benefits from a tightly optimized pipeline on clusters of NVIDIA H100 GPUs.
The models maintain a maximum context length of 32k tokens natively, with up to 128k using established context extension strategies.

Loss optimization is based on a weighted sum of cross-entropy over token sequences at different noise levels. The denoising procedure is carefully tuned for parallel execution on modern accelerator hardware.

Inference and Serving Pipeline

The generation process in Mercury is inherently parallel: multiple tokens are sampled and refined in blocks at each denoising step. This contrasts with AR models, which are limited by strict left-to-right token-by-token decoding. To capitalize on this, Mercury employs a proprietary inference engine that features dynamically batched diffusion sampling, custom kernel optimizations, and dynamic navigation of the speed-quality trade-off.

Key inference properties include:

Throughput: Mercury achieves 1109 tokens/sec (Mini) and 737 tokens/sec (Small) on H100s, according to independent benchmarking. These rates are up to 10× faster than comparable open- and closed-weights AR models within the same quality tier.
Prompting: The model supports both unconditional sequence generation and conditional generations akin to traditional LLMs, backing standard prompting modalities (zero-shot, few-shot, chain-of-thought).
API Compatibility: A drop-in replacement for existing AR APIs (e.g., OpenAI standard), facilitating rapid migration and interoperability for current downstream systems.

Empirical Evaluation

The Mercury Coder variants are extensively evaluated on widely adopted coding benchmarks (HumanEval, MBPP, EvalPlus, MultiPL-E, FIM, LiveCodeBench, BigCodeBench), and their output is validated both quantitatively and via large-scale human ratings (Copilot Arena).

Highlights from the reported results:

Model	HumanEval	MBPP	MultiPL-E	FIM Avg.	Speed (t/s)
Mercury Coder Mini	88.0	77.1	74.1	82.2	1109
Mercury Coder Small	90.0	76.6	76.2	84.8	737
Claude 3.5 Haiku	86.0	78.0	72.3	45.5	61
GPT-4o Mini	88.0	74.6	72.0	60.9	59
Codestral 2501	85.0	72.2	73.4	82.5	171

On FIM-style infilling tasks—crucial for code completion and editing—both Mercury variants surpass all compared models, including frontier open- and closed-weights models, establishing a new state-of-the-art for fill-in-the-middle code synthesis under high-throughput constraints.

On Copilot Arena, Mercury Coder Mini ranks second in output quality but first by a substantial margin in latency, returning completions in 25 ms—approximately 4× faster than GPT-4o Mini.

Theoretical and Practical Implications

Parallel Token Generation: By sidestepping AR dependence, diffusion-based generation enables substantial latency reductions in large-batch/user settings, making LLM-based tooling feasible for real-time and edge applications previously constrained by AR decoding bottlenecks.

Scalability: The denoising paradigm exhibits promising scaling characteristics, with larger Mercury variants delivering monotonic improvements in accuracy on all benchmarks. This suggests that diffusion approaches could continue to harvest quality improvements as model and data size increase, potentially closing any residual quality gap with the largest AR models.

Integration and Deployment: The architecture's compatibility with established serving pipelines and APIs removes typical friction associated with deploying fundamentally novel models. Migration from AR to diffusion-based backends can be accomplished transparently for most downstream services.

Limitations and Future Directions

The present results focus on coding tasks; transferability of the speed-quality trade-off to general natural language or multi-modal settings, though plausible, awaits further empirical validation.
While Mercury models maintain competitive accuracy, absolute state-of-the-art on some coding metrics is held by larger AR models (e.g., GPT-4o, Claude 3.5 Sonnet). Continued investigation into scaling laws for dLLMs and refining denoising objectives will be necessary to advance over the AR baseline in unconstrained generation tasks.
The proprietary nature of some inference infrastructure may limit reproducibility or community-driven optimization in open settings.

Outlook

The introduction of diffusion mechanisms at this scale for LLMs formalizes diffusion not only as a method for high-fidelity generative modeling of continuous data but as a highly practical architecture for discrete sequence tasks with demanding throughput and latency needs. Given the demonstrated compatibility with fine-tuning, alignment (RLHF/DPO), and prompting regimes, diffusion-based LLMs such as Mercury are positioned as a strong basis for next-generation AI systems operating in latency-sensitive production scenarios. Further research into architectures, objectives, and tokenization strategies tailored for diffusion may yield further advances in both performance and controllability in sequence modeling.

PDF Markdown

Related Papers

Tweets

https://twitter.com/adityagrover_/status/1937347657623830598

https://twitter.com/volokuleshov/status/1937344500424400933

https://twitter.com/sujay_kapadnis/status/1942463600288788716

https://twitter.com/InceptionAILabs/status/1938370513577119823

https://twitter.com/ai_database/status/1937867628502241342

https://twitter.com/samar_a_khanna/status/1937571193995035128