Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 119 tok/s Pro
Kimi K2 180 tok/s Pro
GPT OSS 120B 418 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments (2509.14233v1)

Published 17 Sep 2025 in cs.CL, cs.AI, and cs.LG

Abstract: We present Apertus, a fully open suite of LLMs designed to address two systemic shortcomings in today's open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with ~40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.

Summary

  • The paper introduces a compliant LLM ecosystem that enforces retroactive robots.txt and robust filtering, ensuring data privacy and ethical pretraining.
  • The study presents innovative techniques like xIELU activation and Goldfish loss, achieving comparable loss with 30–40% fewer tokens and enhanced training stability.
  • Extensive multilingual benchmarks and transparent, open-sourced artifacts validate the model’s performance and pave the way for reproducible, global LLM research.

Apertus: Open, Compliant, and Multilingual LLMs for Global Language Environments

Introduction and Motivation

The Apertus project presents a suite of LLMs designed to address two persistent deficits in the open LLM ecosystem: (1) the lack of data compliance and transparency, and (2) insufficient multilingual representation, especially for low-resource languages. The work is distinguished by its rigorous approach to data governance, including retroactive enforcement of robots.txt exclusions, comprehensive filtering for non-permissive, toxic, and personally identifiable content, and a strong focus on reproducibility. Apertus models are released at 8B and 70B parameter scales, with all scientific artifacts—data preparation scripts, checkpoints, evaluation suites, and training code—publicly available under permissive licenses.

Data Compliance and Filtering Pipeline

Apertus sets a new standard for data compliance in LLM pretraining. The pretraining corpus is constructed exclusively from openly available data, with retroactive application of robots.txt exclusions as of January 2025, ensuring that content owners' opt-out preferences are respected even for historical web crawls. The pipeline further removes personally identifiable information (PII) and applies multilingual toxicity filtering using language-specific classifiers trained on annotated corpora. Figure 1

Figure 1: Document filtering pipeline for selected resource datasets used during pretraining, encompassing all compliance and quality filters.

This approach results in a measurable reduction in available data (e.g., 8% token loss in English), but is necessary for legal compliance under frameworks such as the EU AI Act. The filtering pipeline is fully reproducible and open-sourced, enabling external audit and extension.

Multilinguality and Tokenization

Apertus is pretrained on 15T tokens spanning 1811 languages, with approximately 40% of the data allocated to non-English content. This is a significant expansion over prior open models, which typically cover an order of magnitude fewer languages. The tokenizer is a byte-level BPE model adapted from Mistral-Nemo, selected via a rigorous evaluation of fertility rate, compression ratio, vocabulary utilization, and Gini coefficient for fairness. Figure 2

Figure 2: The Mistral-Nemo tokenizer achieves superior efficiency and fairness across languages compared to other major multilingual tokenizers.

This choice ensures equitable tokenization costs and efficient representation for both high- and low-resource languages, supporting the model’s global ambitions.

Model Architecture and Training Recipe

Apertus employs a dense decoder-only Transformer architecture with several notable modifications:

  • xIELU activation: A non-gated, piecewise activation function with trainable parameters, empirically shown to outperform SwiGLU in ablations.
  • QK-Norm: Query-key normalization in attention layers for improved stability.
  • RMSNorm and Pre-Norm: For enhanced training stability and efficiency.
  • Grouped-Query Attention (GQA): For inference efficiency at long context lengths.
  • Goldfish Loss: A masked language modeling objective that suppresses verbatim memorization without degrading downstream performance.
  • AdEMAMix Optimizer: A momentum-based optimizer with long-term EMA, yielding faster convergence and improved scaling.

Ablation studies demonstrate that the combination of these architectural and recipe changes enables the model to match the final training loss of a strong Llama baseline with 30–40% fewer tokens, and with improved stability and gradient norms. Figure 3

Figure 3

Figure 3: The final Apertus architecture achieves improved stability and matches baseline loss with substantially fewer tokens.

Pretraining is conducted on up to 4096 GPUs, with context lengths extended up to 65k tokens via staged long-context adaptation.

Pretraining Stability and Efficiency

The pretraining process is characterized by exceptional stability, with no major loss spikes or rollbacks, even during batch size doubling and data mixture transitions. Figure 4

Figure 4

Figure 4

Figure 4

Figure 4: Pretraining loss curves and gradient norms for both 8B and 70B models, demonstrating stable training dynamics.

Aggressive gradient clipping (value 0.1) is applied, especially important for the AdEMAMix optimizer. Experiments with FP8 training yielded a 26% throughput increase but resulted in loss instability, necessitating a rollback to BF16.

Post-Training: Instruction Tuning and Alignment

Post-training follows a two-stage process: supervised finetuning (SFT) on a curated, license-compliant, and decontaminated instruction-following corpus (3.8M+ examples), and alignment using Quantile Reward Policy Optimization (QRPO). QRPO enables direct optimization of absolute reward signals, supporting both reward model and LLM-as-judge feedback, and is shown to outperform DPO at scale.

Alignment is further extended to encode constitutional values, specifically the Swiss AI Charter, validated via public surveys and operationalized through LLM-as-judge scoring for controversial prompts. Figure 5

Figure 5: Survey-based rankings of the relative importance of Swiss AI Charter principles for LLM alignment.

Evaluation: Multilingual, Cultural, and Safety Benchmarks

Apertus models are evaluated on an extensive suite of benchmarks, including general language understanding, factual knowledge, reasoning, mathematics, coding, instruction following, cultural knowledge, and safety. The evaluation suite covers 94 languages, with particular attention to African and Eurasian languages. Figure 6

Figure 6

Figure 6

Figure 6

Figure 6: Downstream evaluation curves during pretraining, showing strong multilingual and regional performance.

Apertus-70B achieves the highest score among all evaluated models on the multilingual XCOPA benchmark and outperforms all other fully open models on INCLUDE V1/V2 and SwitzerlandQA. The models also demonstrate robust performance on low-resource translation tasks, notably for Romansh and its dialects.

Memorization Mitigation and Analysis

Apertus employs the Goldfish loss to suppress verbatim memorization. Empirical analysis using injected Gutenberg sequences at controlled frequencies shows that both 8B and 70B models maintain baseline Rouge-L scores across all exposure frequencies and prefix lengths, indicating effective mitigation. Figure 7

Figure 7: Heatmaps of Rouge-L scores for suffixes of 500 tokens, demonstrating successful mitigation of verbatim memorization across exposure frequencies and prefix lengths.

Further analysis reveals that Goldfish loss disrupts the typical positional fragility of memorization, with recall fluctuating rather than decaying sharply with offset. However, the approach is fragile to near-duplicates due to deterministic masking, and high recall is still observed for highly repetitive or canonical content. Figure 8

Figure 8: Memorization dynamics for sequences injected at different pretraining stages, showing a primacy effect for earlier exposures.

Figure 9

Figure 9: Negative correlation between TTR and Rouge-L, indicating that low-diversity sequences are more susceptible to template memorization.

Infrastructure and Scaling

Apertus was trained on the Alps supercomputing infrastructure, leveraging up to 4096 NVIDIA Grace-Hopper GPUs. The training pipeline incorporates containerized environments, node-vetting, and robust checkpointing strategies. After extensive systems-level and software optimizations, the 70B model achieved 80% strong scaling efficiency at 4096 GPUs, with a throughput of 723 tokens/sec/GPU. Figure 10

Figure 10

Figure 10: Token throughput during training for 70B and 8B models.

Figure 11

Figure 11: Throughput improvements before and after stability enhancements, highlighting the impact of storage and driver fixes.

Figure 12

Figure 12: Strong and weak scaling efficiency for the 70B model, demonstrating high parallel efficiency at scale.

Safety and Security

Safety evaluations include multilingual toxicity and bias benchmarks, as well as manual spot-testing for high-risk scenarios. The models perform comparably to other fully open models on BBQ, HarmBench, and RealToxicityPrompts, but do not pursue jailbreak resistance due to the open-weight nature of the release. The safety pipeline is designed to be robust across languages, with explicit recognition of the challenges posed by low-resource and non-English settings.

Conclusion

Apertus establishes a new baseline for open, compliant, and multilingual LLMs. The project demonstrates that it is feasible to train large-scale, high-performing models with rigorous data governance, full transparency, and strong multilingual coverage. The release of all artifacts under permissive licenses, together with detailed documentation and reproducibility scripts, enables independent audit and extension. The work highlights several open directions, including further scaling, distillation, data-to-performance mapping, adaptive compute, RL with verifiers, multimodality, societal alignment, and field evaluation.

Apertus provides a robust foundation for future research and deployment of trustworthy, globally relevant LLMs, and sets a precedent for open, reproducible, and compliant model development in the field.

Dice Question Streamline Icon: https://streamlinehq.com
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 8 tweets and received 5134 likes.

Upgrade to Pro to view all of the tweets about this paper: