Apertus Project: Open Multilingual LLMs

Updated 26 September 2025

Apertus is a fully open suite of large language models built with strict data compliance and multilingual inclusivity across 1811 languages.
It employs a dense, decoder-only Transformer architecture and introduces the Goldfish objective to minimize verbatim memorization during training.
The project releases full scientific artifacts and source code under permissive licenses, ensuring transparent, reproducible, and legally aligned AI research.

The Apertus Project is a comprehensive, fully open suite of LLMs developed to address critical limitations in the open-model ecosystem, with particular emphasis on data compliance and multilingual inclusivity. Apertus distinguishes itself through its rigorous adherence to open data standards, systematic avoidance of non-permissive and personally identifiable content, and expansive linguistic coverage, all while advancing technical transparency via full release of scientific artifacts under permissive licensing. The models are pretrained exclusively on sources that permit redistribution and exhibit strong performance on multilingual benchmarks, thereby setting new standards for both responsible AI development and global language representation.

1. Data Collection, Filtering, and Compliance

Apertus's data pipeline is explicitly designed to uphold strict data compliance. Pretraining leverages openly available web data, including repositories such as FineWeb–2 and FineWeb-Edu, which collectively encompass 15 trillion tokens across 1811 languages. Around 40% of this corpus is non-English, reflecting a direct effort to counter typical Anglo-centrism in LLM training data. The pipeline retroactively respects robots.txt opt-outs—ensuring exclusion of web content where content-owners have denied crawling or reuse. Further, rigorous automated filters remove copyrighted, toxic, and personally identifiable data. Multilingual fairness is reinforced by applying toxicity classifiers to at least nine languages and by language-specific filtering thresholds.

Intrinsic evaluation of tokenization incorporates a suite of metrics—including fertility, compression ratio, vocabulary utilization, and the Gini coefficient—to ensure balanced representation, particularly for low-resource languages. This systematic approach aims to maximize linguistic inclusivity while minimising ethical and legal risk in global deployment contexts.

2. Pretraining Methodology and the Goldfish Objective

Apertus employs a dense, decoder-only Transformer architecture in two primary scales: 8B and 70B parameters. The models are pretrained on the full 15T-token corpus with a conventional autoregressive objective, but critically, the Goldfish objective is introduced to minimize risk of verbatim memorization and potential data leakage. The Goldfish objective augments the cross-entropy loss with a per-batch, random binary mask $G$ over token targets, formalized as:

$L(\theta) = -\frac{1}{|G|} \sum_{i} G_i \cdot x_i \log P_\theta(x_i \mid x_{<i})$

Here, $G$ is sampled per batch, suppressing verbatim prediction of longer snippet spans while maintaining overall language modeling competencies. This design mitigates the risk of memorizing proprietary or sensitive training data, aligning with content-owner rights and privacy standards that surpass conventional open-weights practices.

3. Model Scaling, Fine-Tuning, and Multilingual Evaluation

Apertus models are released at both 8B and 70B scales. Pretraining recipes are held constant across both variants, with subsequent post-training (instruction tuning) specifically optimized for multilingual use cases. Performance is systematically evaluated against a range of benchmarks:

MMLU and Global-MMLU: Assessing generic language understanding
INCLUDE & SwitzerlandQA: Addressing culturally sensitive and regional nuances

Apertus consistently exhibits competitive, and often superior, scores on these multilingual tasks compared to other open-weight models such as EuroLLM and OLMo2. The extensive language coverage (over 1800 languages) directly underlies this performance, supporting robust representation for both high-resource and low-resource linguistic contexts.

4. Release of Scientific Artifacts and Licensing

All components of the Apertus Project are released under permissive licenses intended to facilitate commercial deployment and collaborative research, while remaining compliant with evolving global regulations, such as the EU AI Act. The project distributes:

Full model weights and checkpoints (intermediate and final)
Data preparation scripts (including those for robots.txt compliance, toxicity filtering)
Pretraining and fine-tuning codebases
Evaluation suite for reproducible benchmarking
Comprehensive technical documentation

This openness enables transparent audit, replication, and extension of both models and development workflows, producing a new standard of trustworthiness in the LLM field.

5. Impact, Significance, and Comparative Positions

Apertus sets a new benchmark for responsible, richly multilingual open LLMs. Its rigorous data compliance reduces legal risk associated with open-weight releases and assures respect for privacy and intellectual property. The proactive linguistic inclusivity allows deployment across culturally and linguistically diverse environments. In performance, the project rivals and, in certain benchmarks, surpasses contemporaneous open models, attesting to the efficacy of its training corpus and methodology.

This suggests that Apertus could become a fundamental reference for future research into ethically scaled and multilingual LLMs, especially where regulatory constraints are prominent.

6. Strategic Directions for Extension and Research

The project outlines explicit future directions:

Scaling to larger models and long-context architectures while retaining compliance and transparency
Distillation of large models into smaller, more efficient variants, expanding accessibility to resource-constrained environments
In-depth analysis of data selection and filtering thresholds with respect to model fairness, capability, and memorization
Integration of adaptive compute and RL techniques (explicit verifier-based reinforcement learning) to enhance reasoning and factual accuracy
Extension to multimodal modalities (e.g., jointly text+image models) under the same ethical and legal constraints
Direct community engagement to validate model fairness and to align outputs with heterogeneous societal values

A plausible implication is the emergence of more granular methodologies for balancing compliance, utility, and multilingualism in large-scale generative models.

7. Ethical and Regulatory Considerations

Apertus’s commitment to transparency and compliance prefigures broader industry trends regarding regulatory standards for AI. By providing complete, reproducible data pipelines and making all artifacts openly available, Apertus facilitates independent verification and audit—critical for institutional and governmental oversight in sensitive domains. The adherence to data owner directives and regulatory frameworks such as the EU AI Act marks a significant shift toward ethical, legally robust machine learning practice.

In conclusion, the Apertus Project advances open, reproducible, and ethically aligned language modeling through technical rigor and comprehensive release practices, fostering greater innovation and trust within the global research and deployment community.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Apertus Project.