Mercury Coder Models: Diffusion & Transformer

Updated 6 November 2025

Mercury Coder models are large language models that employ a diffusion-based generative paradigm integrated with transformers to generate code in parallel.
They leverage extensive pre-training on trillions of code tokens with innovative loss functions and fine-tuning strategies, achieving up to 1109 tokens/s and state-of-the-art fill-in-the-middle performance.
Engineered for practical deployment, Mercury Coder models offer API access and compatibility with existing systems, optimizing both latency and code quality for developer tools.

Mercury Coder models are a class of LLMs for code generation and completion that introduce a diffusion-based generative paradigm, achieving parallel text generation via iterative denoising rather than conventional autoregressive token-by-token decoding. Developed by Inception Labs, Mercury Coder establishes a new state-of-the-art on the speed-quality frontier for code-centric LLMs, leveraging the transformer architecture within a diffusion modeling framework. The suite includes two models—Mini and Small—delivering substantial gains in throughput on current accelerators while maintaining or exceeding the quality of competitive open and proprietary models.

1. Architectural Foundations: Diffusion-Transformer Synthesis

Mercury Coder integrates a diffusion process with a transformer backbone as the central architectural innovation. Diffusion modeling, adapted from generative denoising paradigms, enables prediction and refinement of entire token sequences in parallel.

Forward (Noising) Process $q(z_t | z_{t-1})$ : Starting from the clean token sequence $x \in \mathcal{X}$ , progressively adds noise in a Markov chain across $T$ steps, culminating in a maximally noisy latent $z_T$ , sampled from a fixed noise prior.
Reverse (Denoising) Process $p_\theta(z_{t-1}|z_t)$ : Begins from the noise prior, iteratively denoises via a transformer at each step, gradually reconstructing the original sequence.
Parallel Generation: Unlike autoregressive LLMs which decode a single token at a time, Mercury Coder executes denoising across the entire sequence synchronously at each diffusion step.

Loss Function: The primary training objective is: $\mathcal{L}(x) = -\mathbb{E}_t \left[ \gamma(t) \cdot \mathbb{E}_{z_t \sim q} \log p_\theta( x | z_t ) \right]$ where $\gamma(t)$ weights noise levels and $p_\theta(x | z_t)$ is the transformer’s modeled conditional output.

Significance: This architecture realizes substantial computational parallelism, aligning with modern accelerator capabilities, and retains compatibility with proven transformer optimizations.

2. Training Regimen and Scaling

Data Volume: Mercury Coder is pretrained on trillions of code tokens sourced from both web-crawled open datasets and proprietary, curated codebases (including real and synthetic code).
Alignment and Fine-Tuning: Supports continued pre-training, fine-tuning, and RLHF/DPO-based alignment. The training pipeline is adaptation-aware, mapping reward-based or preference-based objectives into the denoising loss landscape.
Context Handling: Native support for contexts up to 32,768 tokens—with extensions to 128,000 tokens—suits large, real-world codebases.

Technical Note: The model’s training replaces conventional (auto)regressive negative log-likelihood with an iterative denoising likelihood over the noisy sequence space, reflecting the structure of its generative process.

3. Model Variants and Resource Performance

Model	Throughput (tokens/s, H100)	HumanEval	MBPP	EvalPlus	MultiPL-E	Copilot Arena Latency (s)	Copilot Arena Elo	API Access
Mercury Coder Mini	1109	88.0	77.1	78.6	74.1	0.25	993 (2nd)	Yes (API, playground)
Mercury Coder Small	737	90.0	76.6	80.4	76.2	---	---	Yes

Throughput values are measured on NVIDIA H100 GPUs, reflecting real-world latency in server settings.
Quality: Both Mini and Small deliver HumanEval/MBPP/EvalPlus scores on par with or superior to competitive open and commercial LLMs with 5–10× higher speed.
Mercury Coder Mini is the fastest LLM in Copilot Arena, with latency of 0.25 seconds and tied for second-highest quality, as ranked by developer Elo.

4. Benchmarking: Mainstream and Real-World Code Tasks

Core Code Benchmarks:

HumanEval, MBPP, EvalPlus, MultiPL-E: Mercury Coder models consistently match or outperform speed-optimized baselines at orders of magnitude higher throughput.

Fill-in-the-Middle (FIM):

Model	FIM Single-Line	FIM Random-Span	Average
Mercury Coder Mini	92.9	71.5	82.2
Mercury Coder Small	93.1	76.5	84.8

Both models set state-of-the-art scores on FIM, making them highly effective for code completion and infill applications.

Multilingual Coding (MultiPL-E):

Mercury Coder models perform at or above the top open and commercial LLMs in languages such as Java, JavaScript, and TypeScript.

Copilot Arena (Human Validation):

Mercury Coder Mini achieves leading rankings in latency and is tied for second in quality, confirming practical effectiveness in developer-facing settings.

5. Innovations and Systems Integration

Diffusion-Based Parallelism: Enables coarse-to-fine, non-sequential code generation, breaking the conventional bottleneck of autoregressive decoding.
Transformer Compatibility: Leverages all established scaling, optimization, and engineering advances in transformer-based systems, ensuring maturity and extensibility.
Custom Inference Kernels: Incorporates proprietary batching, paging, and kernel approaches for full accelerator efficiency. The API is OpenAI-compatible, allowing integration as a no-code-change substitute in downstream systems.
Long Contexts and Infill Excellence: Built-in support for large contexts and SOTA fill-in-the-middle capacity, targeting LLM assistant and code-copilot workloads.

6. Practical Access, Implementation, and Deployment

API and Playground: Public API is available at platform.inceptionlabs.ai, supporting OpenAI-compatible interfaces. Free play/testing is enabled via chat.inceptionlabs.ai.
Deployment: High-throughput design minimizes infrastructure cost for latency- and scale-sensitive deployments, such as IDE assistants or enterprise devtools.
Extension: The architecture supports further model scaling, new pretraining sources, task-specific fine-tuning, and integration with code search, review, and annotation pipelines.

7. Distinctiveness and Impact

Mercury Coder models signify a transition from autoregressive, sequential LLMs to parallel, diffusion-based contenders in code modeling. The adoption of a diffusion process for token sequence generation is a first at commercial scale for language modeling, achieving:

Substantially higher throughput (1109 tokens/s for Mini) than all open or proprietary coding models at matching or superior quality.
Leading fill-in-the-middle code-infill performance.
Immediate usability via standards-compatible APIs, facilitating widespread deployment without migration frictions.

These characteristics position Mercury Coder as the current Pareto frontier for speed-quality trade-off in practical code LLM applications (Labs et al., 17 Jun 2025).

PDF Markdown Chat (Pro)

References (1)

Mercury: Ultra-Fast Language Models Based on Diffusion (2025)

Follow Topic

Get notified by email when new papers are published related to Mercury Coder Models.