Mercury: Ultra-Fast Language Models

Updated 1 July 2025

Mercury models are a class of diffusion-based language models that use parallel token generation to achieve unprecedented throughput in code generation tasks.
They leverage a reverse denoising process within the Transformer framework to iteratively refine token sequences, ensuring high accuracy and robust performance.
Mercury models deliver exceptional results on benchmarks like HumanEval and MBPP while supporting expansive context windows for diverse coding applications.

Mercury refers to a class of commercial-scale ultra-fast LLMs that leverage diffusion-based generation within the Transformer framework. Mercury models represent a shift from autoregressive token-by-token prediction to parallel, coarse-to-fine sequence denoising, enabling unprecedented decoding throughput while preserving state-of-the-art quality metrics across multiple programming languages and coding benchmarks.

1. Diffusion-Based Generation and Architectural Principles

Mercury employs discrete diffusion processes to generate multiple tokens in parallel—a fundamental departure from conventional autoregressive Transformers. The core idea centers on the application of a Markov diffusion process to token sequences: starting from random noise, the model iteratively denoises the entire output sequence toward a valid completion within a bounded number of steps. At each iteration, the Transformer parameterization refines a partially denoised token sequence, mapping higher-noise states to distributions over the clean vocabulary.

The generative process is governed by two Markov chains:

The forward (noising) process $q$ : $x_0 \to x_1 \to \dots \to x_T$ adds noise to the ground-truth sequence.
The reverse (denoising) process $p_\theta$ : $x_T \to x_{T-1} \to \dots \to x_0$ , parameterized by a Transformer, iteratively removes noise.

Training minimizes a denoising loss over all timesteps: $\mathcal{L}(x) = - \mathbb{E}_t \left[ \gamma(t)\;\mathbb{E}_{x_t \sim q} \log p_\theta(x|x_t) \right]$ where $x_t$ is a noisy tokenization at time $t$ , and $\gamma(t)$ weights the loss per-noise level.

This approach allows Mercury models to use native Transformer stacks—including attention, feed-forward networks, and established LLM optimization—while substituting the customary autoregressive next-token loss for diffusion-based sequence reordering and denoising objectives.

2. Parallel Inference and Throughput Gains

The primary advantage of Mercury’s diffusion paradigm is the capacity for massively parallel token generation. Rather than emitting one token per decoding step (as in standard LLMs), Mercury denoises potentially hundreds of tokens per step, enabling throughput orders of magnitude higher.

On NVIDIA H100 GPUs, Mercury Coder Mini achieves 1,109 tokens/sec and Mercury Coder Small achieves 737 tokens/sec (latency-optimized batch size 1). In direct comparisons, Mercury models outperform all tested speed-optimized decoding baselines—including Llama 3.1 8B Instruct and Codestral 2501—by up to 10 times in average throughput, with no material loss in downstream code generation accuracy.

Empirical results demonstrate that Mercury models maintain state-of-the-art coding performance on HumanEval, MBPP, EvalPlus, MultiPL-E, LiveCodeBench, and BigCodeBench benchmarks. For example, Mercury Coder Small achieves 90.0% HumanEval, 76.6% MBPP, and 80.4% EvalPlus pass rates, with throughput vastly exceeding that of next-best open-weight or proprietary solutions.

Model	HumanEval	MBPP	EvalPlus	Speed (tokens/sec)
Mercury Coder Mini	88.0	77.1	78.6	1,109
Mercury Coder Small	90.0	76.6	80.4	737
Codestral 2501	85.0	72.2	75.6	171
Llama 3.1 8B Instruct	66.5	59.2	60.2	153

On Copilot Arena, a developer-facing, head-to-head evaluation platform, Mercury Coder Mini ranks first in responsiveness and second in overall code quality, with a latency of 0.25 seconds per response and an Elo quality score of 993.

3. Multilingual and Code-Specific Capabilities

Mercury is primarily optimized for code-centric use cases, targeting high-throughput environments such as code completion, editing, agentic (multi-step) coding, and chain-of-thought reasoning. The models support native context windows of up to 32,768 tokens (extendable to 128,000), enabling full-file, multi-document, or agentic scenarios. Performance on MultiPL-E reveals high accuracy across C++, Java, JavaScript, PHP, Bash, and TypeScript, with average scores of 74.1% (Mini) and 76.2% (Small). High accuracy on fill-in-the-middle (FIM) tasks further demonstrates Mercury’s utility for IDE and practical code editing workflows.

Model	MultiPL-E Average	FIM Avg.	Context Window
Mercury Coder Mini	74.1	82.2	32,768 tokens
Mercury Coder Small	76.2	84.8	32,768 tokens

4. Integration and Developer Accessibility

The Mercury models are publicly deployed and accessible via API (OpenAI-compatible) and playground endpoints, removing integration barriers for developers and product teams. The models are architecturally compatible with existing LLM serving frameworks and open-weight infrastructure, as the only modification from vanilla Transformer architectures is at the loss function and generation algorithm.

This enables direct experimentation and comparison in production coding assistants, chatbots, and research pipelines without codebase refactor.

5. Evaluation and Real-World Validation

Independent evaluations by third-party AI benchmarking services (e.g., Artificial Analysis) and human-in-the-loop feedback (Copilot Arena) confirm that Mercury Coder models set a new Pareto frontier on the speed-quality plane for language and code models at their scale. Mercury Coder Mini notably surpasses widely used "frontier" models such as GPT-4o Mini, Llama 3.1 8B, and Codestral 2501 in both throughput and real-world judged quality. On Copilot Arena, Mercury achieves a fourfold lower response latency compared to GPT-4o Mini.

These findings hold across languages, code task difficulty, and prompt types, including complex agentic and fill-in-the-middle scenarios.

6. Implications and Research Outlook

The Mercury diffusion-Transformer approach demonstrates that parallel, diffusion-based generation is viable at commercial scale, delivering ultra-fast inference previously unattainable with autoregressive architectures. This development challenges the belief that large LLMs must sacrifice throughput for quality. Mercury models are architecturally and operationally compatible with RLHF, direct preference optimization (DPO), and context extension tooling.

A plausible implication is that diffusion-based methods may become foundational for future ultra-low-latency LLMs in latency-critical domains, provided appropriate parallel hardware and optimization support.

Potential directions for future research and practical deployment include:

Adapting diffusion LLMs for multi-modal or conversational agents beyond code
Systematic paper of the interaction between denoising schedule length, quality, and real-world latency (especially for longer contexts)
Exploring optimal denoising strategies and Transformer modifications tailored to text over continuous modalities

7. Summary Table: Mercury Versus Prior State-of-the-Art

Property	Mercury Coder Mini	Llama 3.1 8B	Codestral 2501	GPT-4o Mini
HumanEval (%)	88.0	66.5	85.0	–
MBPP (%)	77.1	59.2	72.2	–
Throughput (tokens/s)	1,109	153	171	~200 (est.)
Copilot Arena Latency	0.25s	–	0.31s	0.84s
Copilot Arena Elo	993	–	992	939

Mercury models exemplify a new era of ultra-fast, high-quality LLMs blending diffusion generation and Transformer expressiveness, now validated both by public benchmarks and large-scale developer adoption.

PDF Markdown Chat (Upgrade)