Papers
Topics
Authors
Recent
2000 character limit reached

Instruction-Tuned Mistral 24B

Updated 14 November 2025
  • The paper demonstrates that Instruction-Tuned Mistral 24B leverages supervised fine-tuning on human instruction–response pairs to excel in zero-shot biomedical text simplification, achieving SARI scores around 42.4 and BERTScore of 0.91.
  • Instruction-Tuned Mistral 24B is a 24-billion-parameter, decoder-only Transformer that uses both strict (T=0.2) and flexi (T=0.4) temperature settings to balance deterministic output and creative variability.
  • Its tempered simplification strategy conservatively swaps jargon with common synonyms, maintains high vocabulary retention, and reduces the difficult-word rate, establishing it as a robust baseline for biomedical text simplification.

Instruction-tuned Mistral 24B is a 24-billion-parameter decoder-only Transformer LLM, configured via supervised fine-tuning on human-written instruction–response pairs, and primarily oriented towards effective zero-shot performance on complex instruction-following tasks. In the context of biomedical text simplification, Mistral 24B demonstrates a “tempered” lexical simplification strategy, delivering high readability gains and near-human discourse fidelity. Quantitative analyses situate Mistral 24B in the upper echelon of current LLMs for zero-shot text simplification of health domain content, and identify architectural and operational properties linked to its performance profile.

1. Model Architecture and Instruction-Tuning Procedure

Mistral 24B adheres to a decoder-only, autoregressive Transformer design, comprising approximately 24 billion parameters and utilizing Transformer-XL/Rotary-Embedding components with deep multi-head self-attention and feed-forward layers. Pretraining is conducted on a large, multi-domain text corpus via the canonical next-token (cross-entropy) prediction objective.

Instruction tuning is performed post hoc: Mistral 24B is fine-tuned with supervised learning on large-scale, human-constructed instruction–response datasets, including content sourced from forums, tutorials, and Q&A data. The model is optimized to condition responses on explicit, task-oriented prompts—“instructions”—fostering generalized instruction-following in downstream zero-shot scenarios. Crucially, the publicly released Mistral 24B model introduces no architectural modifications relative to the core Mistral blueprint; only the training corpus and objective shift with instruction tuning.

Inference parameters for biomedical text simplification tasks are tuned across two temperature regimes: a “strict” regime (T=0.2T = 0.2) for deterministic outputs, and a “flexi” (T=0.4T = 0.4) regime for heightened variability. All other parameters, including top-k, top-p, and repetition penalties, remain fixed.

2. Lexical Simplification Operational Strategy

When deployed for biomedical abstract simplification as a zero-shot prompt task, Mistral 24B exhibits a conservative lexical operation profile:

  • Primary tactic—jargon/parlance swap: Specialized scientific or clinical terms are mapped to common-language synonyms prioritizing accessibility while preserving meaning.
  • Selective omission: Decorative or granular detail may be excised (“omit unnecessary detail”), though this is secondary to synonymization.
  • Minimal expansion: The model rarely adds explanatory clauses or new information beyond the original; this contrasts with human editors, who may favor pedagogical expansions.
  • High vocabulary retention: Key source domain terms are frequently preserved, with a vocabulary-match score near 0.65, supporting fidelity to technical content.
  • Controlled jargon reduction: The difficult-word rate, defined as the fraction of tokens not present in a 3,000-word simple lexicon or with three or more syllables, drops from approximately 60% (human reference) to 48% (Mistral output), demonstrating substantial yet nonaggressive simplification.

This operational profile, referenced here as “tempered simplification” (Editor's term), seeks balance between readability enhancement and information loss minimization.

3. Quantitative Performance Across Simplification Metrics

Performance is evaluated with a suite of metrics over a benchmark biomedical dataset, emphasizing both readability improvement and preservation of original semantic content.

3.1 SARI (Add/Keep/Delete)

The SARI metric is formally defined as follows for a test set of NN examples: SARI=1Nn=1NF1Add,n+F1Keep,n+PDel,n3\text{SARI} = \frac{1}{N}\sum_{n=1}^{N} \frac{\mathrm{F1}_{\mathrm{Add},n} + \mathrm{F1}_{\mathrm{Keep},n} + \mathrm{P}_{\mathrm{Del},n}}{3} This aggregates the F1F_1 score of word additions (F1Add,n\mathrm{F1}_{\mathrm{Add},n}), F1F_1 for word retention (F1Keep,n\mathrm{F1}_{\mathrm{Keep},n}), and deletion precision (PDel,n\mathrm{P}_{\mathrm{Del},n}), each compared to human references.

  • Mistral 24B (flexi): SARI = 42.46 (95% CI [41.86, 43.05])
  • Mistral 24B (strict): SARI = 42.37 (95% CI [41.77, 42.96])
  • Previous transformer baselines (T5, BART): \sim34.0; GPT-4.1-mini: \sim43.8

Mistral 24B thus establishes a new state-of-the-art baseline for zero-shot biomedical simplification using instruction-tuned LLMs.

3.2 BERTScore (Discourse Fidelity)

BERTScore is a semantic similarity measure computed as: P=1CwiCmaxwjRcos(e(wi),e(wj))P = \frac{1}{|C|}\sum_{w_i\in C} \max_{w_j'\in R} \cos(\mathbf{e}(w_i),\mathbf{e}(w_j'))

R=1RwjRmaxwiCcos(e(wj),e(wi))R = \frac{1}{|R|}\sum_{w_j'\in R} \max_{w_i\in C} \cos(\mathbf{e}(w_j'),\mathbf{e}(w_i))

BERTScore=2PRP+R\text{BERTScore} = \frac{2\,P\,R}{P + R}

where CC is the candidate text, RR the reference, and e()\mathbf{e}(\cdot) denotes the contextual embedding.

  • Mistral 24B: BERTScore = 0.91 (matches human ground truth)
  • Qwen2.5 32B: BERTScore = 0.89 (p<0.05p < 0.05, significantly lower)

This positions Mistral 24B as matching human-level discourse preservation while achieving advanced simplification.

4. Correlation and Redundancy Among Simplification Metrics

Comprehensive correlation analysis on 21 metrics—covering readability (6 indices), discourse fidelity (5), content safety, and distributional statistics—yields several key findings:

  • Strong redundancy among grade-based readability indices: Flesch-Kincaid, ARI, SMOG, FKGL, and Gunning Fog scores exhibit intercorrelations ρ0.7|\rho| \geq 0.7; Dale-Chall shows moderate, less redundant correlation (ρ0.4\rho \approx 0.4–0.6) due to its vocabulary test focus.
  • Syntactic vs. lexical simplification: Readability correlates more strongly with mean sentence length (ρ0.4\rho \approx 0.4–0.7) than with difficult-word rate (Mistral: ρ0.4\rho \approx 0.4; human: $0.2$–$0.3$; Qwen: $0.1$), indicating LLMs more easily simplify syntax than lexicon.
  • Discourse fidelity cluster: BERTScore aligns tightly with ROUGE-L and SacreBLEU (ρ0.8\rho \approx 0.8–$0.9$), reflecting overlapping capture of n-gram and paraphrase-level fidelity. LDA-Topic metrics offer a distinct topicality dimension.
  • System-dependent cross-metric associations: Mistral 24B’s readability-accuracy coefficients (e.g., BERTScore vs. FKGL, Dale-Chall) are positive and moderate (ρ0.2\rho \approx 0.2–$0.4$); Qwen’s are near zero or negative (ρ0.2\rho \approx-0.2–$0.1$). Difficult-word reduction is positively associated with discourse preservation for Mistral (ρ0.2\rho \approx 0.2–$0.5$), suggesting aligned simplification; this pattern does not hold for Qwen.

5. Comparative Architectural Advantage and Domain Adaptation Implications

Instruction-tuned Mistral 24B exhibits distinctive properties relative to alternate architectures:

  • Balance of readability and fidelity: The model achieves readability increases (SARI \approx 42.4, US grade 12–14) while maintaining high semantic fidelity (BERTScore 0.91, LDA-Topics 0.37).
  • Stability across generation variability: Performance is robust to temperature (T=0.2T = 0.2 vs. T=0.4T = 0.4), evidencing strong task-specific priors conferred by instruction tuning.
  • Conservative lexical control: Domain term retention remains high (vocab-match \approx 0.65), while difficult-word usage is substantially reduced; metric correlations indicate tight coupling between lexical simplification and accuracy, contrasting with the looser coupling observed for Qwen.

A plausible implication is that instruction-tuned LLMs possess an architectural predisposition toward integrated simplification and discourse preservation strategies, distinct from other LLM classes.

6. Practical Considerations and Future Directions

  • Out-of-the-box utility: Instruction-tuned LLMs like Mistral 24B deliver competitive text simplification performance on health content without task-specific fine-tuning, contingent on prompt design.
  • Domain adaptation priorities: As syntactic simplification is largely resolved, future adaptation should emphasize lexical resource enrichment—e.g., domain-specific glossaries, paraphrase repositories—rather than exhaustive retraining.
  • Metric selection heuristics: Given the documented redundancy among readability formulas, evaluators need only monitor a minimal subset (e.g., a polysyllable-based index, Dale-Chall), with BERTScore supplanting multiple n-gram metrics.
  • Generalization hypothesis: Validation across additional domains (legal, financial) and LLMs is warranted to establish the generality of the observed instruction-tuning advantage—the “tempered” simplification-accuracy coupling.

These findings establish instruction-tuned Mistral 24B as a reference model in the evolving landscape of automatic biomedical text simplification, and provide data-driven heuristics for both quantitative evaluation and domain-adaptation strategies (Githinji et al., 7 Nov 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Instruction-Tuned Mistral 24B.