Propella-1: Structured Multilingual Annotation LLMs
- Propella-1 is a family of small, multilingual decoder-only LLMs that replaces single scalar scores with detailed, compositional JSON annotations.
- It employs a rigorous JSON schema with 18 properties in 6 categories, enabling flexible, rule-based filtering and comprehensive corpus analysis.
- Built on Qwen-3 with a 64K context length, the models achieve high throughput and strong evaluation metrics, scaling to annotate billions of documents.
Searching arXiv for “Propella-1” and closely related records to ground the article in current papers. Propella-1 is a family of small multilingual decoder-only LLMs for structured document annotation in LLM data curation. Introduced as an alternative to assigning each pretraining document a single scalar “quality” score, it emits a typed JSON object over 18 properties organized into six categories and supports 57 languages. Its stated role spans pretraining data selection, filtering, corpus analysis, and interpretability, and the flagship 4B model is paired with a public release, propella-annotations, containing 3,005,080,817 document annotations over major pretraining corpora (Idahl et al., 12 Feb 2026).
1. Origin, scope, and problem formulation
Propella-1 was introduced against a backdrop in which data curation for LLM pretraining had “predominantly relied on single scalar quality scores produced by small classifiers.” The paper identifies two major problems with that paradigm: single-score conflation, in which one scalar mixes together multiple latent notions such as reasoning depth, educational structure, objectivity, integrity, density, and safety; and low interpretability and inflexible filtering, in which thresholding a score does not support queries such as “high reasoning + low commercial bias + no PII + expert audience.” It also highlights limited multilingual coverage in prior work, noting that multilingual extensions still usually reduce documents to one score (Idahl et al., 12 Feb 2026).
Within that framing, Propella-1 is positioned not as another scalar scorer but as an annotation layer for compositional data curation. Instead of predicting one value, it outputs a structured profile over content quality, content type, educational and reasoning value, audience, safety, and geographic relevance. This supports filtering over fields rather than over a black-box aggregate number. A plausible implication is that the system is meant to shift curation from scalar thresholding to rule-based selection over interpretable metadata.
2. Model family, architecture, and inference format
The released family contains three models, all based on Qwen-3 and all decoder-only: propella-1-0.6b, propella-1-1.7b, and propella-1-4b. The paper states a 64K context length and gives three reasons for the decoder-only design: the models can handle long documents natively, emit all 18 labels in one pass as structured JSON, and internalize the detailed annotation rubric via fine-tuning (Idahl et al., 12 Feb 2026).
The models support 57 languages. The training set is about 35% English, with the remainder covering many European languages plus Arabic, Chinese, Japanese, Korean, Thai, and others. The paper also states dedicated support for code, mathematical content, and post-training data such as instruction-following and conversations (Idahl et al., 12 Feb 2026).
At inference time, each model emits a JSON object conforming to a predefined schema with enumerated values for all categorical properties. The output contains no whitespace to minimize token count. Operationally, the system is served with SGLang and llguidance, enforcing strict schema-constrained generation. The paper states that, as a result, every response is guaranteed to be valid JSON, fields stay within allowed labels, no post-hoc validation or retry logic is needed, filtering can be done directly over typed fields, and annotations can be joined to source corpora via document IDs (Idahl et al., 12 Feb 2026).
| Model | Parameters | Reported H100 throughput |
|---|---|---|
| propella-1-0.6b | 0.6B | 39.9 docs/s |
| propella-1-1.7b | 1.7B | 39.1 docs/s |
| propella-1-4b | 4B | 27.0 docs/s |
The 4B deployment figures are reported in two places in the provided data. One benchmark states propella-1-4b (fp8): 27.0 docs/s, equivalent to 10.3 GPU-hours per 1M documents, with 50.1K prompt tokens/s and 3.9K output tokens/s on one H100 96GB GPU. Additional throughput numbers list A100 80GB: 10.3 docs/s, 27.0 h / 1M docs and H100 96GB: 22.4 docs/s, 12.4 h / 1M docs for the 4B model. The paper explicitly attributes the runtime profile to prefill, since inputs are full documents and outputs are short JSON objects (Idahl et al., 12 Feb 2026).
3. Annotation ontology and compositional curation
The ontology comprises 18 properties in 6 categories, with one free-text field and 17 quantitatively evaluated fields. The six higher-level groups are Core Content, Classification, Quality and Value, Audience and Purpose, Safety and Compliance, and Geographic Relevance (Idahl et al., 12 Feb 2026).
| Category | Properties |
|---|---|
| Core Content | Content Integrity; Content Ratio; Content Length |
| Classification | One-Sentence Description; Content Type; Business Sector; Technical Content |
| Quality and Value | Content Quality; Information Density; Educational Value; Reasoning Indicators |
| Audience and Purpose | Audience Level; Commercial Bias; Time-Sensitivity |
| Safety and Compliance | Content Safety; PII Presence |
| Geographic Relevance | Regional Relevance; Country Relevance |
The ontology is unusually granular. Content Integrity distinguishes complete, mostly_complete, fragment, and severely_degraded. Content Ratio measures substantive content versus boilerplate, ranging from complete_content to minimal_content. Content Length measures meaningful-content length, with substantial defined as 2,000+ words and minimal as under 100 words (Idahl et al., 12 Feb 2026).
The Classification group separates function, domain, and technicality. Content Type is multi-select over 18 possible functional genres/purposes, including analytical, instructional, reference, technical_documentation, source_code, and structured_data. Business Sector is multi-select over 37 sectors. Technical Content is multi-select over 7 technicality types: code_heavy, math_heavy, scientific, data_heavy, engineering, basic_technical, and non_technical (Idahl et al., 12 Feb 2026).
The Quality and Value group explicitly decomposes “quality” into orthogonal axes. Content Quality runs from excellent to unacceptable; Information Density from dense to empty; Educational Value from high to none; and Reasoning Indicators from analytical to none. The paper stresses that this makes it possible to preserve technically valuable documents that may score poorly on simplistic educational metrics, or to exclude documents that are polished but commercially biased or thin in signal (Idahl et al., 12 Feb 2026).
The Audience and Purpose group introduces expert, advanced, general, beginner, youth, and children audience labels; commercial-bias labels from none to pure_marketing; and time-sensitivity labels from evergreen to time_sensitive. The Safety and Compliance group contains safe, mild_concerns, nsfw, harmful, and illegal, plus a binary PII field. The Geographic Relevance group separates language from geography through 14 regional categories and country-level labels using ISO-3166 country names or the special values supranational and none (Idahl et al., 12 Feb 2026).
A representative example in the paper includes fields such as "content_quality":"excellent", "information_density":"dense", "educational_value":"high", "reasoning_indicators":"analytical", "audience_level":"expert", "commercial_bias":"none", "content_safety":"safe", and "pii_presence":"no_pii". This suggests a document-level metadata layer that can be queried compositionally rather than linearly ranked.
4. Training data, supervision, and multilingual coverage
The training data was created by annotating a diverse document sample using multiple frontier LLMs, prompted with the full annotation rubric and a strict response schema. The detailed annotator prompt used for data generation was about 14K tokens, while the deployed Propella prompt is about 800 tokens, because the fine-tuned model has already internalized the task (Idahl et al., 12 Feb 2026).
The training set covers 57 languages, with approximately 35% English. Dedicated categories include code (2.82%), math (2.77%), and sft (2.41%). The larger language shares listed in the appendix are eng_Latn 35.08, spa_Latn 3.98, ita_Latn 3.97, fra_Latn 3.95, deu_Latn 3.86, and pol_Latn 3.81 (Idahl et al., 12 Feb 2026).
The source mix is intentionally heterogeneous. Main sources include HPLT 3.0 (unfiltered): 39.59%, FineWeb: 16.01%, FineWeb-2: 13.23%, FinePDFs: 8.09%, FineWeb-2 (removed): 6.28%, FineWeb-Edu (dedup): 3.92%, The Stack: 2.04%, FineMath: 2.00%, OpenHermes: 2.00%, and FineWiki: 1.35%, plus smaller RedPajama and Nemotron sources. This spans web crawls, PDFs, code, math, synthetic instruction data, and curated corpora (Idahl et al., 12 Feb 2026).
The paper is only partially explicit about the teacher setup. It states that training labels were created using multiple frontier LLMs, and that the reference annotator in evaluation is Gemini-3-Pro with reasoning effort set to "high". It also notes that, because API content filters sometimes refused problematic documents, a small subset was annotated manually. The paper does not provide the exact commercial models used in training, teacher-voting mechanics, supervised sample counts, or training loss formulas (Idahl et al., 12 Feb 2026).
Training uses 64K context, fp8 mixed-precision training, and 4 × H100 GPUs, with training completing within a few hours per model variant. The absence of reported loss functions or calibration equations is an explicit limitation of the paper’s level of disclosure.
5. Evaluation protocol, agreement scores, and large-scale deployment
Evaluation uses a test set of 3,000 documents. For these documents, the authors obtain annotations from Gemini-3-Pro with reasoning effort "high" and treat them as reference labels. The paper explicitly acknowledges that this is not human gold annotation and may reflect shared LLM biases (Idahl et al., 12 Feb 2026).
The metrics vary by property type: Quadratic Weighted Kappa (QWK) for 11 ordinal properties, F1 for the binary PII field, and Intersection-over-Union (IoU / Jaccard) for 5 multi-select properties. The one-sentence description field is excluded from quantitative evaluation. The paper gives the overall metric as
$\text{overall} = \frac{11}{17} \times \overline{\text{QWK} + \frac{1}{17} \times \text{F1} + \frac{5}{17} \times \overline{\text{IoU}$
and notes, in the supplied details, that this expression appears malformed and likely contains a typesetting error; preserving it exactly is therefore necessary (Idahl et al., 12 Feb 2026).
The key reported top-line scores are propella-1-4b overall = 0.779, propella-1-0.6b overall = 0.729, and propella-1-1.7b overall = 0.737. The paper states that propella-1-4b exceeds Gemini-3-Flash and all open-weight baselines despite being significantly smaller, and that it achieves higher agreement than much larger general-purpose models. A precision comparison is also reported: for the 4B model, bf16 0.779, fp8 0.783, difference +0.51%; for the 1.7B model, bf16 0.737, fp8 0.731, difference −0.81% (Idahl et al., 12 Feb 2026).
For production use, the deployment evidence is unusually explicit. The paper states that the system was scaled across thousands of GPUs using inference-hive, and one reported run used 3,936 A100 GPUs to annotate about 500 million FineWeb-2 documents in ~3.5 hours. This suggests that the design target is not merely annotation quality but annotation at web-corpus scale.
6. Propella-annotations and corpus-level analysis
A major contribution accompanying the models is propella-annotations, a released dataset of 3,005,080,817 document annotations generated with propella-1-4b. The source breakdown is: FineWeb-2: 1,632,650,735, HPLT 3.0: 694,920,477, FinePDFs: 365,048,869, Nemotron-CC (EN high-quality split): 155,688,999, SYNTH: 77,908,583, finewiki: 43,097,138, and German Commons: 35,716,016. While the models support 57 languages, the released annotation dataset currently covers English plus 14 European languages, with the largest volumes in German, Spanish, French, and Italian (Idahl et al., 12 Feb 2026).
Each record contains the full set of 18 annotations keyed by the document identifier from the original corpus, allowing metadata to be joined back to the source corpus without rehosting source text. The licensing is Apache 2.0 for model weights and CC-BY-4.0 for annotations, with release on the Hugging Face Hub (Idahl et al., 12 Feb 2026).
The paper presents three case studies to argue that multi-property annotation reveals structure hidden by scalar scoring. In German multi-source profiling, FinePDFs has 21.4% “excellent” quality versus 2.4% in FineWeb-2, about 12× more content with analytical reasoning indicators, and 7.4% high educational value versus 0.6% in FineWeb-2; HPLT 3.0 shows more content fragments and degraded documents than FineWeb-2. In a Nemotron-CC audit, even the “high” tier still contains documents with heavy or pure-marketing commercial bias, thin information density, and incomplete content integrity. In FineWeb-2, the paper reports notable cross-language variation in content quality, commercial bias, educational value, and content type composition, with commercial bias and information density showing the largest variation, suggesting that multilingual curation likely needs language-specific thresholds (Idahl et al., 12 Feb 2026).
These analyses are central to the paper’s conceptual claim. The release is not only an annotation model and not only a metadata dataset; it is also an empirical argument that pretraining corpora differ substantially in quality, reasoning depth, and composition along axes obscured by scalar filtering.
7. Limitations, cautions, and significance
The paper is explicit about several limitations. Evaluation is against Gemini-3-Pro, not humans; teacher-generated training labels and evaluation labels may share biases; LLM judges may prefer outputs similar to their own generations; multilingual judging reliability may vary by language, especially low-resource ones; underrepresented languages and content types may be less consistent; and the rubric reflects one team’s design choices. Most importantly, the paper states that it has not yet shown downstream pretraining gains from Propella-based filtering (Idahl et al., 12 Feb 2026).
It also cautions that safety and PII annotations are useful signals, not substitutes for dedicated moderation or privacy review. This is a significant boundary condition. A plausible misconception is to treat Propella-1 as a general-purpose safety or governance system; the paper instead presents it as a structured annotation layer within a broader curation pipeline.
Operationally, the intended workflow is straightforward: run Propella-1 over a raw or partially filtered corpus, store JSON metadata keyed by document ID, profile the corpus by language, source, domain, or property distributions, define compositional filtering rules, and sample or weight documents according to those structured conditions. The paper gives examples such as keeping documents with high reasoning_indicators, low commercial_bias, and no_pii, or building scientific subsets using technical_content ∈ {scientific,data_heavy} with content_quality ∈ {good,excellent} (Idahl et al., 12 Feb 2026).
In that sense, Propella-1 marks a shift from “one document = one quality number” to “one document = a structured profile over 18 curated properties.” The paper’s strongest claim is not merely that a small model can imitate a larger annotator, but that compositional metadata can expose and operationalize distinctions—reasoning versus pedagogy, density versus polish, objectivity versus marketing, language versus geography—that single-score pipelines systematically collapse.