Propella-Annotations: Multi-Property LLM Dataset

Updated 4 July 2026

Propella-Annotations is a large-scale dataset of structured, multi-property document annotations designed for LLM pretraining and fine-tuning.
The system replaces a single-score paradigm with 18 properties across six categories to enable flexible, interpretable filtering and rebalancing.
It supports actionable corpus curation by providing detailed metrics on quality, safety, and language relevance at billion-document scale.

Propella-Annotations is a large-scale, open dataset of structured, multi-property document annotations designed specifically for curating LLM pretraining and fine-tuning corpora. Produced by the propella-1 family of small multilingual annotator LLMs, it replaces the single-score paradigm of corpus filtering with 18 properties organized into six categories, yielding per-document JSON annotations that support flexible, interpretable filtering, rebalancing, and exclusion decisions at billion-document scale. The release contains over three billion document-level annotations spanning several major pretraining sources, and is described as the largest open release of its kind (Idahl et al., 12 Feb 2026).

1. Rationale and problem setting

Propella-Annotations is motivated by two limitations of single-score curation pipelines. First, a scalar quality score collapses multiple dimensions into one opaque number, conflating reasoning depth, commercial bias, information density, safety and compliance, and geographic relevance. Second, many pipelines and regressors are trained and tuned for English only, creating monolingual bias in multilingual curation settings. Propella-Annotations addresses both limitations by assigning each document 18 complementary properties across six categories, with typed fields and enumerated value sets where applicable, so that practitioners can compose explicit predicates rather than rely on a single threshold (Idahl et al., 12 Feb 2026).

The system’s practical claim is not merely that more labels are available, but that these labels are structurally separable. Technical specifications or legal analysis can have low educational value but high reasoning relevance; conversely, high educational scores can coexist with heavy marketing bias or PII presence. Propella-Annotations therefore supports filters such as “reasoning-rich AND high information density AND no PII AND low commercial bias AND relevant to target geography,” making trade-offs explicit rather than implicit.

This design places Propella-Annotations within data-centric LLM training rather than within conventional document ranking. Its annotations are intended to inform what to keep, rebalance, or exclude from a training mix, and its broader significance lies in enabling transparent, governance-aware curation across multilingual corpora.

2. Annotation schema, categories, and corpus coverage

The dataset consists of per-document JSON annotations covering six categories and 18 properties. The value types are consistent across the schema: ordinal, binary, multi-select, and free text. Enumerated value sets are provided where applicable, enabling interpretable analytics and precise downstream filtering (Idahl et al., 12 Feb 2026).

Category	Properties
Core content	Content Integrity, Content Ratio, Content Length
Classification	One-Sentence Description, Content Type, Business Sector, Technical Content
Quality and value	Content Quality, Information Density, Educational Value, Reasoning Indicators
Audience and purpose	Audience Level, Commercial Bias, Time-Sensitivity
Safety and compliance	Content Safety, PII Presence
Geographic relevance	Regional Relevance, Country Relevance

Several properties have especially consequential semantics for corpus curation. Content Integrity measures completeness and technical readability using four ordinal values: complete, mostly_complete, fragment, and severely_degraded. Content Ratio captures the proportion of substantive content versus navigation or UI. Quality and value are decomposed into Content Quality, Information Density, Educational Value, and Reasoning Indicators. Audience and purpose include Audience Level, Commercial Bias, and Time-Sensitivity. Safety and compliance include Content Safety and PII Presence. Geographic relevance is modeled through both Regional Relevance, with 14 enumerated region values, and Country Relevance, which is multi-select free text constrained to ISO-3166 country names and also supports supranational and none.

A representative annotation instance is:

$1/17$2

The release covers several major pretraining corpora. FineWeb-2 contributes 1,632,650,735 annotations across DE, ES, FR, IT, SV, and FI; HPLT 3.0 contributes 694,920,477 across DE and FI; FinePDFs contributes 365,048,869 across English and 14 European languages; Nemotron-CC contributes 155,688,999 from its English high-quality split; SYNTH contributes 77,908,583 multilingual synthetic SFT and conversation records; finewiki contributes 43,097,138 multilingual encyclopedic documents; and German Commons contributes 35,716,016 German documents, for a total of 3,005,080,817 annotations. The underlying models support 57 languages, while the current release covers English and 14 European languages.

3. propella-1 models and annotation pipeline

Propella-Annotations is generated by the propella-1 model family, which comprises three decoder-only models based on Qwen-3 with 0.6B, 1.7B, and 4B parameters. These models are trained on 57 languages with about 35% English and the remainder spread across European languages, Arabic, Chinese, Japanese, Korean, Thai, and others. They handle long documents natively up to 64K context and are fine-tuned to emit strict JSON conforming to a predefined schema. The output is compact JSON with no whitespace, explicitly to minimize output tokens (Idahl et al., 12 Feb 2026).

Training labels were produced using frontier commercial LLMs prompted with the full rubric and strict schema, while a small subset was manually annotated where API filters refused problematic content. Fine-tuning was performed in fp8 mixed precision with 64K context on 4× H100 GPUs per variant, completing in hours per model. At serving time, structured output is enforced by SGLang with llguidance, which guarantees schema-conformant JSON with correct enumerated values and removes the need for post-hoc validation or retries.

Throughput is dominated by prefill. For the 4B model on H100 fp8, reported performance is 27.0 documents per second per GPU, corresponding to approximately 10.3 GPU-hours per million documents, with prompt throughput of approximately 50.1K tokens per second and output throughput of approximately 3.9K tokens per second. Other measured configurations include 4B A100 at 10.3 documents per second and 27.0 hours per million documents; 4B H100 bf16 at 22.4 documents per second and 12.4 hours per million; 1.7B H100 fp8 at 39.1 documents per second and 7.1 hours per million; and 0.6B H100 at 39.9 documents per second and 7.0 hours per million.

The release also reports cluster-scale execution. Using inference-hive over a SLURM cluster, one deployment ran propella-1-4b on 3,936 A100 GPUs to annotate approximately 500 million FineWeb-2 documents in approximately 3.5 hours, or roughly 142.9 million documents per hour across the cluster. Deduplication, additional post-processing, and model calibration beyond strict schema enforcement are not described for the release.

4. Evaluation protocol and agreement results

Evaluation is conducted against a reference annotator rather than against direct human adjudication. The reference system is Gemini-3-Pro with reasoning effort set to “high,” evaluated on a 3,000-document test set. The one-sentence description field is excluded from quantitative scoring. Metrics are assigned by property type: Quadratic Weighted Kappa for the 11 ordinal properties, F1 for the binary PII Presence property, and Intersection-over-Union averaged across samples for the five multi-select properties. The overall score is defined as a weighted average by property-type counts, with weights $11/17$, $1/17$, and $5/17$ assigned to the means of QWK, F1, and IoU respectively (Idahl et al., 12 Feb 2026).

The principal reported result is that propella-1-4b reaches an overall score of 0.779, exceeding Gemini-3-Flash and all larger open-weight baselines on this task. The 0.6B model reaches 0.729, which the paper presents as strong quality for a small annotator model. Precision-mode comparisons are close: the 4B model scores 0.779 in bf16 and 0.783 in fp8, a $+0.51\%$ difference; the 1.7B model scores 0.737 in bf16 and 0.731 in fp8, a $-0.81\%$ difference. Per-property breakdowns are reported as robust across all evaluated properties, with full figures deferred to the appendices.

The evaluation section also clarifies the metric definitions employed in the paper’s discussion. F1 is the standard harmonic mean of precision and recall. IoU is the Jaccard overlap $\frac{|A \cap B|}{|A \cup B|}$ for predicted and reference multi-label sets. QWK weights disagreement by squared ordinal distance. These choices are consistent with the heterogeneity of the schema: some properties require distance-aware ordinal comparison, some require exact binary discrimination, and some require set overlap.

5. Multi-dimensional corpus analysis and curation workflows

The paper’s main substantive claim is that multi-property annotation reveals corpus composition differences that single-score filtering obscures. Three case studies are reported. In German multi-source profiling, FinePDFs shows markedly higher “excellent” content quality than FineWeb-2, at 21.4% versus 2.4%; FinePDFs also has approximately 12× more content flagged with analytical reasoning indicators; and “high” educational value appears in 7.4% of FinePDFs versus 0.6% of FineWeb-2. HPLT 3.0 exhibits more fragments and degraded content than FineWeb-2. In Nemotron-CC quality-tier auditing, higher tiers correlate with better content quality and information density on average, but even the “high” tier includes heavy or pure_marketing commercial bias, thin information density, and incomplete content integrity. In cross-language FineWeb-2 analysis, distributions differ substantially by language for content quality, commercial bias, educational value, and content-type composition, with the largest reported cross-language variations occurring in commercial bias and information density (Idahl et al., 12 Feb 2026).

These results motivate typed, compositional filtering rather than scalar thresholding. The paper gives several example predicates. A high-information, reasoning-rich, safe, and low-commercial-bias corpus can be defined by requiring information_density in $\{dense, adequate\}$ , reasoning_indicators in $\{analytical, explanatory\}$ , content_safety in $\{safe, mild_concerns\}$ , pii_presence = no_pii, and commercial_bias in $\{none, minimal\}$ . A pedagogically oriented technical subset can be defined by educational_value in $1/17$0, technical_content including any of $1/17$1, and content_type including instructional or technical_documentation. A geography-aware mix can require regional_relevance to include european and country_relevance to include specific countries or supranational values.

No specific numeric thresholds are prescribed. Recommended practice is to work directly with categorical levels and combine properties according to task-specific goals. Because records are keyed to source corpus document IDs, the annotations can be joined back to original text collections without redistributing the text itself. The paper also states that the dataset’s small footprint supports fast, iterative filtering and analysis.

6. Access, licensing, and reproducibility

Propella-Annotations is released at hf.co/datasets/openeurollm/propella-annotations under CC-BY-4.0, with records keyed to source corpus document IDs. The propella-1 models are released at hf.co/collections/ellamind/propella-1 under Apache 2.0, alongside serving configurations and prompts. Structured output serving uses SGLang and llguidance, while large-scale orchestration uses inference-hive for SLURM clusters. The complete rubric is released with the model weights and includes property definitions, enumerations, decision trees, and language-specific guidelines (Idahl et al., 12 Feb 2026).

The reproducibility posture is unusually explicit for a corpus-annotation release. The paper emphasizes the co-release of open-weight models, serving configurations, rubrics, and code. It also notes that schema-conformant JSON is intended to obviate downstream cleaning at the annotation layer. At the same time, some operational details are left unspecified: files, shards, compression, and schema versioning are not explicitly detailed in the paper, although the dataset card on Hugging Face is said to provide usage examples and practical guidance.

Licensing is permissive but differentiated. Model weights are Apache 2.0 and permit commercial use. Annotations are CC-BY-4.0 and therefore require attribution. The combination of open weights and open annotations is central to the release’s claim of practical reproducibility.

Propella-Annotations is primarily a large-scale dataset release, but adjacent work in annotation systems supplies a broader conceptual frame. Textarium presents a web-based environment in which annotation, abstraction, and argumentation co-evolve through parameterized, shareable visualization states serialized into URL hashes; interpretive actions are embedded directly into scrollytold essays as inspectable evidence (Proff et al., 16 Sep 2025). Adamite, a documentation annotation tool for programmers, shows that short, contextual annotations with multi-anchoring, pinning, and typed states can materially affect learning outcomes: readers of annotations completed 67% more of the task, on average, than the baseline (Horvath et al., 2021). AnnoGram reifies annotations as first-class declarative elements in a Grammar of Graphics, using target-and-effect semantics and a placement pipeline resolved against the compiled scene graph (Rahman et al., 6 Jul 2025). “Learning from Imperfect Annotations” proposes an end-to-end framework that merges aggregation with model training, models annotator competence and example difficulty, reports accuracy gains of up to 25%, and matches the best alternative with up to 4× less redundancy (Platanios et al., 2020). This broader literature suggests that Propella-Annotations can be understood not only as a static dataset but also as part of a wider research program on interpretable, structured, and uncertainty-aware annotation.

The current release also states several limitations. Agreement is measured against Gemini-3-Pro as a model-as-judge reference annotator rather than against human adjudication, which risks shared biases with the training labels and may overstate agreement. Because propella-1 ingests labels generated by frontier LLMs, bias inheritance remains a concern, especially in underrepresented languages and content types. The paper notes that multilingual LLM-as-judge consistency can be low, that per-language test samples may be too small to quantify such effects robustly, that the rubric was designed by one team and may not capture every curation-relevant dimension, and that no formal evidence is yet presented showing that multi-property filtering improves downstream training outcomes relative to single-score selection. Confidence calibration, per-property uncertainty scores, and deduplication are also absent from the release; outputs are categorical without confidences (Idahl et al., 12 Feb 2026).

Planned extensions are correspondingly concrete. The paper identifies downstream training experiments comparing multi-property filtering with single-score selection, expansion of languages, properties, and evaluators, tighter multilingual reliability studies, possible confidence modeling per property, integration with transformation and synthetic generation pipelines, community feedback loops through open releases and dataset cards, and incremental updates with additional corpora and languages. A plausible implication is that future evolutions of Propella-Annotations may combine its current corpus-scale JSON schema with more interactive, inspectable, and uncertainty-aware annotation workflows, but the present system is defined above all by its multi-property release format and its role in large-scale LLM data curation.