Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Magistral Medium: Advanced Reinforcement Learning for LLM Reasoning

Updated 24 June 2025

Magistral Medium is a LLM for advanced reasoning, introduced as part of the Magistral project by Mistral. It is constructed via a scalable, asynchronous distributed reinforcement learning (RL) pipeline, departing from earlier methods that rely on supervised fine-tuning (SFT) or distillation from preexisting reasoning traces. Magistral Medium is trained entirely on custom infrastructure, with no dependence on proprietary RL traces or reasoners, and leverages a suite of architectural, algorithmic, and reward design choices that set new baselines for RL-driven LLMs in mathematical, coding, instruction-following, and multilingual tasks.

1. Scalable RL Training Pipeline

Magistral Medium is trained from the Mistral Medium 3 base checkpoint using a pure RL framework—Group Relative Policy Optimization (GRPO)—eschewing any SFT on reasoning traces. The underpinning infrastructure is an asynchronous, distributed RL system optimized for large GPU clusters. It operationalizes three classes of worker processes:

  • Generators: Continuously generate completions with the most recent weights.
  • Verifiers: Score completions for correctness, formatting, and language consistency.
  • Trainers: Aggregate batches of verified outputs to compute gradients and update model weights, which are then broadcast to generators without pausing ongoing generations.

The asynchronous update mechanism leverages GPU-to-GPU transfers (via NCCL), allowing in-flight generation to proceed using slightly outdated key/value caches. This maximizes hardware utilization and throughput, facilitating stable and scalable RL from scratch.

GRPO Algorithm

The training objective follows the GRPO framework. For a prompt qq and policy πθ\pi_\theta, the loss function is:

JGRPO(θ)=EqP(Q),{oi}i=1Gπθold(q)[i=1Gt=1oi1oi(min[πθ(oi,tq,oi,<t)πθold(oi,tq,oi,<t)A^i,t, clip(πθ(oi,tq,oi,<t)πθold(oi,tq,oi,<t),1ε,1+ε)A^i,t]βDKL[πθ(q)πref(q)])]\begin{align*} \mathcal{J}_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{\mathrm{old}}}(\cdot|q)} \Bigg[ \sum_{i=1}^{G} \sum_{t=1}^{|o_i|} \frac{1}{|o_i|} \Bigg( \min\Bigg[&\frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{\mathrm{old}}}(o_{i,t}|q, o_{i,<t})} \hat{A}_{i,t}, \ &\text{clip}\Bigg(\frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{\mathrm{old}}}(o_{i,t}|q, o_{i,<t})}, 1-\varepsilon, 1+\varepsilon\Bigg) \hat{A}_{i,t}\Bigg] - \beta D_{\mathrm{KL}}[\pi_\theta(\cdot|q) \Vert \pi_{\mathrm{ref}}(\cdot|q)] \Bigg) \Bigg] \end{align*}

Key choices and modifications include:

  • No KL Penalty: The KL-divergence term is removed (β=0\beta = 0), enabling the model to diverge significantly from the initial checkpoint and reducing computation and communication overhead.
  • Advantage Normalization: Advantages A^i,t\hat{A}_{i,t} are normalized first within prompt groups (using riμr_i - \mu), and then over minibatches, facilitating stable learning.
  • Length Normalization: Loss is normalized over output token count, mitigating length bias.
  • Aggressive Exploration: The upper PPO clipping bound εhigh\varepsilon_{\text{high}} is set to 0.26–0.28, promoting probability increases for rare but valuable reasoning steps.
  • Group Diversity Filtering: Prompt groups with homogeneous correctness (all completions correct or all wrong) are filtered out to ensure learning signal.
  • Asynchronous Data Handling: Minibatches aggregate completed sequences across trainers, while ongoing generations are not interrupted by parameter updates.

This pipeline is distinct from previous more synchronous RLHF implementations and is oriented toward sustained exploration and system throughput.

2. Structured Reasoning and Reward Shaping

Magistral Medium internalizes a structured reasoning "language" enforced both at training and inference:

  • Format Enforcement: Model outputs begin and end with > ... tags, encapsulating the chain-of-thought (CoT). For mathematics, the final answer is presented outside the tags, wrapped in LaTeX \boxed{} commands; for code, in a markdown code block. Any formatting deviation receives zero reward.
  • Language Alignment: Both the internal reasoning and the answer must be in the user's language, mediated by reward assignments and SFT data augmentation (10% of SFT traces are translated into non-English languages).
  • Reward Shaping Dimensions:
    • Formatting: Correct tags and answer format.
    • Correctness: For mathematics, answers are parsed and symbolically checked with sympy; for code, completion is benchmarked via test-case execution.
    • Length Penalty: Output approaching the maximal token length incurs a soft penalty, balancing thoroughness with conciseness.
    • Language Consistency: Discriminated by a fastText classifier—rewards favor outputs whose structure, chain-of-thought, and answers are consistently in the prompt’s language.

The system prompt further instructs the model to "draft your thinking process (inner monologue)" followed by a summary answer "in the same language as the task," encouraging entropy and detailed process exposition.

3. Empirical Performance

Magistral Medium demonstrates substantial improvements over its initialization checkpoint across a range of benchmarks:

Task Mistral Medium 3 (Base) Magistral Medium (RL-Only) Improvement
AIME'24 (pass@1/maj@64) 26.8 / 43.4 73.6 / 90.0 +~47 / +47
LiveCodeBench (v5) 29.1 59.4 +30.3
MATH-500 91.0 94.3 +3.3
GPQA 59.6 70.8 +11.2
Instruction Following 86.8 87.4 +0.6
Function Calling 87.2 87.4 +0.2
Multimodal (MMMU-Pro, etc.) (base) +5 to +12 percentage pts
  • Mathematical Reasoning: 73.6% pass@1 and up to 90.0% maj@64 on AIME'24, a ~50% increase relative to the base.
  • Code Generation: Doubled performance on LiveCodeBench (v5) and improvement on Aider Polyglot (28.9% to 47.1%).
  • Multimodal and STEM QA: Gains of +5–12 points on MathVista, MMMU, and MMMU-Pro, despite RL training on text only.
  • General Knowledge: GPQA sees an 11-point gain; Humanity’s Last Exam doubles from 4.4% to 9.0%.
  • Multilingual Reasoning: Degradation is limited to 4–10% compared to English, similar to the starting point.

These results indicate that pure RL, when applied to text data with a robust reward structure, can maintain or enhance base model capabilities across reasoning, multimodal, and functional call tasks.

4. Cold-Start Data and Knowledge Transfer

After training Magistral Medium, its verified, high-quality reasoning completions are aggregated as "cold-start data" (Editor’s term: internally generated reasoning traces). This data is then used for SFT on Mistral Small 3, producing Magistral Small. Supplementary sources, including OpenThoughts and OpenR1 code, augment the prompt and reasoning diversity. Unlike previous work which indicates that RL is generally ineffective alone for smaller models, this dual approach—SFT on cold-start traces followed by RL—successfully endows Magistral Small with robust reasoning capabilities.

5. Open-Source Release and Ecosystem Impact

Magistral Small (24B parameters), constructed from Magistral Medium’s cold-start traces and further RL, is open-sourced under the Apache 2.0 license. This release has several implications:

  • Transparency and Reproducibility: Enables broad evaluation and replication, as the model’s traces and RL pipeline are not tainted by proprietary outputs.
  • Research Baseline: Serves as a public, high-performance baseline for reasoning-centric LLMs, suitable for benchmarking various training paradigms.
  • Ecosystem Uplift: Provides the first fully open, state-of-the-art reasoning model with open data provenance, addressing previous limitations of commercial models that only offered indirect or closed-source reasoning traces.

This fosters avenues for further innovation in RL reward design (RLVR), infrastructure optimization, and investigations into hybrid SFT–RL methods.

6. Evaluation, Technical Practices, and System Design

The technical rigor of Magistral Medium spans several key practices:

  • Reward Assignment: Grounded in verifiable correctness (math via symbolic checks, code via execution), augmenting reliability of learning signals.
  • Batching and Asynchronous Handling: Efficient, asynchronous updates enable trainers to operate at full capacity while generators and verifiers handle the evaluation pipeline in parallel.
  • Data Filtration: Only challenging, verificable math and code problems are selected, scored using trained RL models to ensure dataset quality.
  • Multilingualism: Language consistency rewards and SFT data augmentation ensure broad applicability.

This technical scaffolding underscores the model’s benchmarking and deployment robustness.

Conclusion

Magistral Medium is a milestone in the development of reasoning-specialized LLMs, demonstrating that large-scale RL, orchestrated through asynchronous, high-throughput infrastructure and open reward signals, can produce models that outperform SFT or distillation-based LLMs in complex reasoning and coding tasks. The dissemination of Magistral Small under a permissive license establishes a new reference point for open LLM research and application, supporting transparent, community-driven progress in the field.