Magistral Medium: Advanced Reasoning with Reinforcement Learning
Last updated: June 15, 2025
Background and Significance
Magistral Medium is the first reasoning-focused model from Mistral °, introduced alongside a proprietary reinforcement learning (RL) pipeline designed entirely in-house (Mistral-AI et al., 12 Jun 2025 ° ). Unlike prior LLMs ° that rely on teacher traces, distillation, or inherited infrastructure, Magistral Medium explores RL-trained reasoning skills ° from the ground up. The model is distinct in its direct application of RL for advanced mathematical, coding, and chain-of-thought (CoT) tasks, establishing itself as both a technical advance and an open-science landmark.
Historically, RL approaches for LLMs have favored instruction-following or distilled reasoning, often through transferred traces. Magistral Medium diverges from this practice, structuring its pipeline and reward mechanisms ° to directly target reasoning and multilingual CoT skills (Mistral-AI et al., 12 Jun 2025 ° ).
Foundational Approach: Scalable RL Pipeline and Group Relative Policy Optimization
Asynchronous RL Pipeline
The Magistral Medium RL pipeline orchestrates three principal asynchronous, distributed roles:
- Generators (Actors): Continuously generate completions in response to prompts.
- Trainers: Aggregate these completions, organize them into batches/minibatches, and perform optimization.
- Verifiers: Automatically assess completions for formatting, correctness (mathematical or code), and language consistency, assigning structured rewards (Mistral-AI et al., 12 Jun 2025 ° ).
Model weights are updated asynchronously via NCCL, ensuring that actors do not block and the system remains efficiently on-policy while maintaining maximal GPU ° throughput. Hyperparameters governing batch size, minibatch size, and asynchronous updates are tuned to ensure no length bias, maintaining the ratio for pipeline stability (Mistral-AI et al., 12 Jun 2025 ° ).
Group Relative Policy Optimization (GRPO)
Magistral Medium employs a version of Group Relative Policy Optimization ° (GRPO), deviating from standard Proximal Policy Optimization ° (PPO) and KL-regularized algorithms:
Key differences include:
- No KL penalty: The pipeline omits KL-reference modeling for efficiency, as sufficient divergence is encountered during training.
- Length normalization: Loss is normalized by sequence length ° to avoid bias against longer, multi-step completions.
- Advantage normalization: Group-level normalization reduces gradient noise °.
- Flexible upper clipping: Allows reinforcement of rare, informative reasoning sequences.
- Diversity filtering: Only groups with at least one correct and one incorrect sequence are included, omitting non-informative batches (Mistral-AI et al., 12 Jun 2025 ° ).
This design enables efficient, robust RL ° training without external traces or distillation.
Reasoning Language, Reward Model, and Enforcement
Magistral Medium employs a strictly enforced, structured reasoning format ° in every response, underpinned by both system prompting and reward mechanisms:
- Formatting: All answers must begin with a
> ...
block—an informal, detailed chain-of-thought—followed by a concise summary with a boxed answer or a code block (for programming). - Verifiers enforce:
- Formatting compliance: Rewards 0 for incorrect, +0.1 for correct format.
- Mathematical or code correctness: +0.9 reward for correct answers via symbolic math checks or code test suites °.
- Language consistency: +0.1 if all parts (prompt, reasoning, answer) match, verified with a fastText-based classifier.
- Length and completeness: Penalties for cut-off or excessively brief solutions; shaping to encourage full reasoning (Mistral-AI et al., 12 Jun 2025 ° ).
- Multilingual coverage: 10% of RL training employs data translated into French, German, Spanish, Italian, Chinese, and Russian. The reward system ensures the full CoT and final answer match the prompt’s language.
By forcing both structure and language consistency, the model delivers interpretable, multilingual CoT output without external translation or additional data processing.
Performance Metrics
Quantitative Benchmark Outcomes
Magistral Medium demonstrates significant improvements over Mistral Medium 3 on a range of reasoning tasks:
Task | Mistral Medium 3 | Magistral Medium | Improvement |
---|---|---|---|
AIME’24 (pass@1 °) | 26.8 | 73.6 | +46.8 |
AIME’24 (majority@64) | 43.4 | 90.0 | +46.6 |
LiveCodeBench ° (v5) | 29.1 | 59.4 | +30.3 |
GPQA ° | 59.6 | 70.8 | +11.2 |
Instruction following ° | 86.8 | 87.4 | +0.6 |
Function calling ° | 87.2 | 87.4 | ≈0 |
Multimodal (MMMU, etc.) | — | SOTA/improved | up to +12% |
Source: (Mistral-AI et al., 12 Jun 2025 ° )
Magistral Medium nearly triples the pass@1 result on AIME’24 and doubles accuracy on major coding benchmarks ° compared to its base checkpoint. Instruction following and function calling tasks are maintained or slightly improved, with no degradation (Mistral-AI et al., 12 Jun 2025 ° ).
Multimodal Reasoning
Despite RL being performed solely on text data, Magistral Medium maintains or slightly improves performance in multimodal domains ° (e.g., MMMU and MathVista) by up to 12%. The model is capable of complex image+text question answering with CoT reasoning (Mistral-AI et al., 12 Jun 2025 ° ).
Instruction Following and Function Calling
Evaluation demonstrates that general instruction and internal function calling abilities ° are unaffected or marginally enhanced by the RL procedure, exhibiting no notable performance decline (Mistral-AI et al., 12 Jun 2025 ° ).
Training Data and Magistral Small
Following RL training, solutions generated by Magistral Medium are extracted, filtered for diversity and quality, and repurposed as supervised fine-tuning (SFT °) data for Magistral Small—a 24B parameter model ° (Mistral-AI et al., 12 Jun 2025 ° ). Magistral Small is trained on these traces, supplemented by open-dataset prompts and a 10% fraction of instruction-tuning data ° to preserve general skills, then undergoes RL finetuning. This approach leverages high-quality reasoning traces ° to reliably transfer capabilities to smaller models.
Results show that this SFT + RL procedure outperforms pure SFT or RL alone for models of similar size, demonstrating the value of RL-bootstrapped data (Mistral-AI et al., 12 Jun 2025 ° ).
Open-Source Contribution and Ecosystem Impact
Magistral Small is released under the Apache 2.0 license, offering unrestricted commercial, academic, and further developmental use. This open-source release:
- Facilitates research and product deployment absent vendor lock-in °.
- Provides a robust baseline for RL and reasoning model ° experimentation.
- Contributes to transparency and reproducibility in RL-finetuned LLM ° research (Mistral-AI et al., 12 Jun 2025 ° ).
Limitations and Prospective Developments
While Magistral Medium represents a significant advance in reasoning RL from scratch, important considerations remain:
- RL training is performed exclusively on text, with multimodal generalization empirically observed on specific benchmarks, pending broader validation.
- Omitting external teacher traces may limit some forms of nuanced or creative reasoning present in teacher-enriched approaches.
- Pipeline stability depends on careful tuning of batch-related hyperparameters, reflecting the inherent complexity of scalable distributed RL for LLMs (Mistral-AI et al., 12 Jun 2025 ° ).
Speculative Note:
The finding that RL on text alone preserves or enhances multimodal capabilities ° hints at substantial cross-modal transfer within LLMs. This could potentially motivate RL training pipelines that generalize across modalities or support the development of “universal” reward models °. Broader empirical confirmation is required (Mistral-AI et al., 12 Jun 2025 ° ).
Conclusion
Magistral Medium demonstrates that RL pipelines built from scratch can endow LLMs with advanced multilingual and structured reasoning capabilities, rivaling models distilled from existing experts. Through a combination of a custom RL architecture, group-normalized objectives, enforced reasoning language, and carefully designed rewards, Magistral Medium and its open-source sibling Magistral Small contribute valuable resources for the advancement and reproducibility of reasoning-optimized LLMs (Mistral-AI et al., 12 Jun 2025 ° ).