Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts
Detailed Answer
Thorough responses based on abstracts and some paper content
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash
78 tokens/sec
GPT-4o
77 tokens/sec
Gemini 2.5 Pro Pro
60 tokens/sec
o3 Pro
16 tokens/sec
GPT-4.1 Pro
66 tokens/sec
DeepSeek R1 via Azure Pro
34 tokens/sec
2000 character limit reached

Magistral Medium: Advanced Reasoning with Reinforcement Learning

Last updated: June 15, 2025

Background and Significance

Magistral Medium is the first reasoning-focused model from Mistral °, introduced alongside a proprietary reinforcement learning (RL) pipeline designed entirely in-house (Mistral-AI et al., 12 Jun 2025 ° ). Unlike prior LLMs ° that rely on teacher traces, distillation, or inherited infrastructure, Magistral Medium explores RL-trained reasoning skills ° from the ground up. The model is distinct in its direct application of RL for advanced mathematical, coding, and chain-of-thought (CoT) tasks, establishing itself as both a technical advance and an open-science landmark.

Historically, RL approaches for LLMs have favored instruction-following or distilled reasoning, often through transferred traces. Magistral Medium diverges from this practice, structuring its pipeline and reward mechanisms ° to directly target reasoning and multilingual CoT skills (Mistral-AI et al., 12 Jun 2025 ° ).

Foundational Approach: Scalable RL Pipeline and Group Relative Policy Optimization

Asynchronous RL Pipeline

The Magistral Medium RL pipeline orchestrates three principal asynchronous, distributed roles:

  • Generators (Actors): Continuously generate completions in response to prompts.
  • Trainers: Aggregate these completions, organize them into batches/minibatches, and perform optimization.
  • Verifiers: Automatically assess completions for formatting, correctness (mathematical or code), and language consistency, assigning structured rewards (Mistral-AI et al., 12 Jun 2025 ° ).

Model weights are updated asynchronously via NCCL, ensuring that actors do not block and the system remains efficiently on-policy while maintaining maximal GPU ° throughput. Hyperparameters governing batch size, minibatch size, and asynchronous updates are tuned to ensure no length bias, maintaining the ratio nasync/nbatch2n_\text{async}/n_\text{batch} \leq 2 for pipeline stability (Mistral-AI et al., 12 Jun 2025 ° ).

Group Relative Policy Optimization (GRPO)

Magistral Medium employs a version of Group Relative Policy Optimization ° (GRPO), deviating from standard Proximal Policy Optimization ° (PPO) and KL-regularized algorithms:

JGRPO(θ)=Eq,{oi}[1i=1Goii=1Gt=1oimin(πθ(oi,tq,oi,<t)πθold(oi,tq,oi,<t),1εlow,1+εhigh)A^i,tnorm]\mathcal{J}_\mathrm{GRPO}(\theta) = \mathbb{E}_{q, \{o_i\}} \left[ \frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^{G} \sum_{t=1}^{|o_i|} \min\left(\frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_\text{old}}(o_{i,t}|q, o_{i,<t})}, 1-\varepsilon_\text{low}, 1+\varepsilon_\text{high}\right)\hat{A}^{\text{norm}}_{i,t}\right]

Key differences include:

  • No KL penalty: The pipeline omits KL-reference modeling for efficiency, as sufficient divergence is encountered during training.
  • Length normalization: Loss is normalized by sequence length ° to avoid bias against longer, multi-step completions.
  • Advantage normalization: Group-level normalization A^i,tnorm\hat{A}_{i,t}^\text{norm} reduces gradient noise °.
  • Flexible upper clipping: Allows reinforcement of rare, informative reasoning sequences.
  • Diversity filtering: Only groups with at least one correct and one incorrect sequence are included, omitting non-informative batches (Mistral-AI et al., 12 Jun 2025 ° ).

This design enables efficient, robust RL ° training without external traces or distillation.

Reasoning Language, Reward Model, and Enforcement

Magistral Medium employs a strictly enforced, structured reasoning format ° in every response, underpinned by both system prompting and reward mechanisms:

  • Formatting: All answers must begin with a > ... block—an informal, detailed chain-of-thought—followed by a concise summary with a boxed answer or a code block (for programming).
  • Verifiers enforce:
    • Formatting compliance: Rewards 0 for incorrect, +0.1 for correct format.
    • Mathematical or code correctness: +0.9 reward for correct answers via symbolic math checks or code test suites °.
    • Language consistency: +0.1 if all parts (prompt, reasoning, answer) match, verified with a fastText-based classifier.
    • Length and completeness: Penalties for cut-off or excessively brief solutions; shaping to encourage full reasoning (Mistral-AI et al., 12 Jun 2025 ° ).
  • Multilingual coverage: 10% of RL training employs data translated into French, German, Spanish, Italian, Chinese, and Russian. The reward system ensures the full CoT and final answer match the prompt’s language.

By forcing both structure and language consistency, the model delivers interpretable, multilingual CoT output without external translation or additional data processing.

Performance Metrics

Quantitative Benchmark Outcomes

Magistral Medium demonstrates significant improvements over Mistral Medium 3 on a range of reasoning tasks:

Task Mistral Medium 3 Magistral Medium Improvement
AIME’24 (pass@1 °) 26.8 73.6 +46.8
AIME’24 (majority@64) 43.4 90.0 +46.6
LiveCodeBench ° (v5) 29.1 59.4 +30.3
GPQA ° 59.6 70.8 +11.2
Instruction following ° 86.8 87.4 +0.6
Function calling ° 87.2 87.4 ≈0
Multimodal (MMMU, etc.) SOTA/improved up to +12%

Source: (Mistral-AI et al., 12 Jun 2025 ° )

Magistral Medium nearly triples the pass@1 result on AIME’24 and doubles accuracy on major coding benchmarks ° compared to its base checkpoint. Instruction following and function calling tasks are maintained or slightly improved, with no degradation (Mistral-AI et al., 12 Jun 2025 ° ).

Multimodal Reasoning

Despite RL being performed solely on text data, Magistral Medium maintains or slightly improves performance in multimodal domains ° (e.g., MMMU and MathVista) by up to 12%. The model is capable of complex image+text question answering with CoT reasoning (Mistral-AI et al., 12 Jun 2025 ° ).

Instruction Following and Function Calling

Evaluation demonstrates that general instruction and internal function calling abilities ° are unaffected or marginally enhanced by the RL procedure, exhibiting no notable performance decline (Mistral-AI et al., 12 Jun 2025 ° ).

Training Data and Magistral Small

Following RL training, solutions generated by Magistral Medium are extracted, filtered for diversity and quality, and repurposed as supervised fine-tuning (SFT °) data for Magistral Small—a 24B parameter model ° (Mistral-AI et al., 12 Jun 2025 ° ). Magistral Small is trained on these traces, supplemented by open-dataset prompts and a 10% fraction of instruction-tuning data ° to preserve general skills, then undergoes RL finetuning. This approach leverages high-quality reasoning traces ° to reliably transfer capabilities to smaller models.

Results show that this SFT + RL procedure outperforms pure SFT or RL alone for models of similar size, demonstrating the value of RL-bootstrapped data (Mistral-AI et al., 12 Jun 2025 ° ).

Open-Source Contribution and Ecosystem Impact

Magistral Small is released under the Apache 2.0 license, offering unrestricted commercial, academic, and further developmental use. This open-source release:

Limitations and Prospective Developments

While Magistral Medium represents a significant advance in reasoning RL from scratch, important considerations remain:

  • RL training is performed exclusively on text, with multimodal generalization empirically observed on specific benchmarks, pending broader validation.
  • Omitting external teacher traces may limit some forms of nuanced or creative reasoning present in teacher-enriched approaches.
  • Pipeline stability depends on careful tuning of batch-related hyperparameters, reflecting the inherent complexity of scalable distributed RL for LLMs (Mistral-AI et al., 12 Jun 2025 ° ).

Speculative Note:

The finding that RL on text alone preserves or enhances multimodal capabilities ° hints at substantial cross-modal transfer within LLMs. This could potentially motivate RL training pipelines that generalize across modalities or support the development of “universal” reward models °. Broader empirical confirmation is required (Mistral-AI et al., 12 Jun 2025 ° ).

Conclusion

Magistral Medium demonstrates that RL pipelines built from scratch can endow LLMs with advanced multilingual and structured reasoning capabilities, rivaling models distilled from existing experts. Through a combination of a custom RL architecture, group-normalized objectives, enforced reasoning language, and carefully designed rewards, Magistral Medium and its open-source sibling Magistral Small contribute valuable resources for the advancement and reproducibility of reasoning-optimized LLMs (Mistral-AI et al., 12 Jun 2025 ° ).