Create a Video View Paper

Mellum 2: A 12B MoE Model That Codes Like a Frontier System on a Single GPU

This presentation explores Mellum 2, a 12-billion-parameter Mixture-of-Experts language model purpose-built for software engineering. Despite activating only 2.5 billion parameters per token, it matches or surpasses much larger dense models on coding, reasoning, and tool-use benchmarks while running efficiently on commodity hardware. We examine the architectural decisions, training strategies, and post-training techniques that enable this breakthrough in practical AI deployment for developers.

Script

Most frontier coding models demand clusters of expensive accelerators, but what if you could get comparable performance from just 2.5 billion active parameters on a single card? Mellum 2 is a 12-billion-parameter Mixture-of-Experts model that activates only 8 of its 64 experts per token, bridging the gap between small dense models that plateau on hard reasoning and massive systems too costly to deploy.

The architecture makes every choice count. Three-quarters of attention layers use a sliding window of 1024 tokens to slash latency on long inputs, while the remaining global layers preserve long-range reasoning. Grouped-Query Attention with just 4 key-value heads keeps the cache small under high concurrency, and a Multi-Token Prediction head trained jointly as an auxiliary task boosts HumanEval by 10 points and GSM8K by 3 without adding inference cost.

Training on 10.65 trillion tokens follows a three-phase curriculum that shifts from broad web data toward curated code and math. The first 58% of training establishes foundations with mostly web text. The middle phase upsamples code to 42% and injects high-quality instruction data. The final 16% is dominated by synthetic and expert-verified datasets, with code samples accounting for nearly 60%, sharpening the model's programming and reasoning capabilities.

Efficiency is where the architecture pays off. Mellum 2 delivers the same single-token latency as Qwen 2.5 7B at 192 tokens per second, but under high concurrency it's 21% faster, and compared to Qwen 3 8B it's 79% faster. The sparsity, grouped-query attention, and sliding windows combine to let a 12-billion-parameter model outpace denser competitors in real-world serving scenarios.

Post-training with supervised fine-tuning and reinforcement learning from verifiable rewards pushes Mellum 2 to the top of its class. On EvalPlus function-level coding it scores 78.4, on LiveCodeBench competitive programming the thinking variant hits 75.1, and on tool use it leads at 45.6. It also achieves 58.4 on the AIME math benchmark and ranks as the safest in its category with only 8.4% harmful responses on HarmBench.

Mellum 2 proves that frontier-level coding and reasoning don't require data-center budgets. With all weights open and a reproducible recipe, developers can deploy a 12-billion-parameter assistant on a single H100, getting the quality of much larger systems at a fraction of the cost. Learn more about Mellum 2 and create your own deep-dive videos at EmergentMind.com.