SumeCzech: Czech News Summarization Corpus
- SumeCzech is a large-scale Czech news summarization corpus containing roughly 1 million article–summary pairs from five major outlets with authentic human-written abstracts.
- The dataset features extensive metadata and automated preprocessing, including named-entity annotations, to support both extractive and abstractive evaluation using metrics like ROUGE and ROUGE_NE.
- It serves as a key benchmark for testing advanced neural architectures and cross-lingual summarization methods in a morphologically rich, medium-resource language context.
The SumeCzech dataset is a large-scale, modern Czech news summarization corpus designed to provide a robust benchmark for research in both monolingual and, by extension, cross-lingual text summarization for morphologically rich, medium-resource languages. Its comprehensive size, broad topical coverage, and reliance on high-quality human-authored abstracts have rendered it foundational in the assessment and development of both extractive and abstractive summarization systems for Czech. SumeCzech is used extensively as a testbed for advanced neural architectures, including multilingual LLMs and entity-aware summarization frameworks, and its associated resources have also catalyzed the creation of derivative datasets and evaluation protocols in related work.
1. Corpus Composition and Structure
SumeCzech, introduced by Straka et al. (2018), consists of approximately 1 million article–summary pairs drawn from five major Czech news outlets: České Noviny, Deník, iDNES, Lidovky, and Novinky.cz (Tran et al., 24 Nov 2025, Tran et al., 14 Aug 2025, Marek et al., 2021). Each record comprises a full-text news article, a human-written abstract (either as a headline or a multi-sentence summary), and rich metadata including URL, headline, abstract, article text, subdomain, section, and publication date. The official dataset split is ~86.5% for training (≈865,000 examples), ~4.5% for validation (≈45,000), and ~4.5% for testing (≈45,000). The dataset’s average full-text document length is ≈409 words and the average summary is ≈38 words (Tran et al., 24 Nov 2025, Tran et al., 14 Aug 2025).
The annotation relies exclusively on the original news authors’ summaries; no external rewrites, crowdsourcing, or further manual curation (beyond filtering non-empty and sufficiently long fields) was performed. This preserves naturalistic headline-generation and abstract-writing conventions for Czech news.
| Field | Type/Role | Notes |
|---|---|---|
| url | metadata | Article permalink |
| headline | summary | Often used as a single-sentence summary |
| abstract | summary (multi-sentence) | Human-written |
| text | full article | News body |
| section, date | metadata | Enriched domain and time information |
2. Preprocessing, Annotation, and Entity Enrichment
The SumeCzech dataset incorporates several preprocessing steps: automatic Czech language detection, duplicate removal by article text, and the exclusion of records with any empty or extremely short fields (Tran et al., 24 Nov 2025, Tran et al., 14 Aug 2025). No further normalization or tokenization is applied at the data release stage; model pipelines typically apply their own tokenization, such as subword methods motivated by Czech’s high morphological richness.
Named-entity annotation was subsequently introduced in derivative work (Marek et al., 2021) using a retrained SpaCy Czech NER model (on the CNEC 2.0 corpus), providing seven entity types: numbers in addresses, geographical names, institutions, media names, artifact names, personal names, and time expressions (IOB2 tagging). This automated enrichment is extensive (e.g., abstracts alone contain ≈2.5 million named-entity tokens in the training set), but the annotation process is automatic with no inter-annotator agreement figures (single-tagger application). These enrichments enable novel evaluation metrics such as ROUGE_NE, which assesses named-entity overlap in abstractive outputs.
3. Formal Definitions and Evaluation Protocols
Let denote the total number of document–summary pairs (≈1,000,000), the set of articles, and their corresponding summaries. Let and represent the length (in words or tokens) of and , respectively: (Tran et al., 24 Nov 2025)
The primary evaluation metrics are ROUGE-N (for ) and ROUGE-L, computed as “raw” scores with no stemming or lemmatization (i.e., exact token overlap). The formulas for precision, recall, and F1 are as follows:
ROUGE-L is based on the longest common subsequence (LCS) metrics and analogously computed. ROUGE_NE (“Named Entity ROUGE”) uses token-level overlap only for named-entity tokens (IOB2 tagging) (Marek et al., 2021).
No statistical significance testing is reported for ROUGE scores in major reference works (Tran et al., 24 Nov 2025, Tran et al., 14 Aug 2025).
4. Baselines, Neural Architectures, and State-of-the-Art Results
SumeCzech has served as the principal benchmark for evaluating both extractive and abstractive summarization models for Czech:
- Extractive baselines: First sentence, random sentence, TextRank, and the entity-density heuristic (sentence with highest named-entity density).
- Abstractive baselines: Neural sequence-to-sequence models (LSTM/GRU-based, with and without entity features), Transformer models (e.g., mT5, mBART), and large-scale LLMs (Mistral 7B with QLoRA fine-tuning) (Tran et al., 14 Aug 2025, Marek et al., 2021).
An overview of recent results (“ROUGE_raw” F1 scores, test set):
| Model/Baseline | ROUGE-1 F1 | ROUGE-2 F1 | ROUGE-L F1 |
|---|---|---|---|
| Mistral 7B (M7B-SC) | 21.2 | 5.7 | 15.5 |
| mT5-base (mT5-SC) | 19.2 | 4.6 | 14.1 |
| HT2A-S (mBART) | 18.2 | 4.6 | 13.5 |
| First-sentence | 14.4 | 0.2 | 0.9 |
| Random | 12.7 | 0.1 | 0.8 |
| Textrank | 13.8 | 0.3 | 0.8 |
| Tensor2Tensor | 11.3 | 0.1 | 0.8 |
The introduction of Mistral 7B fine-tuning achieves a +3.0 point gain in ROUGE-1 F1 over the previous best (HT2A-S) (Tran et al., 14 Aug 2025, Tran et al., 24 Nov 2025).
In entity-aware settings, augmenting Seq2Seq architectures with explicit entity features yields minor gains in out-of-domain robustness (Marek et al., 2021).
5. Dataset Derivatives, Cross-Lingual Extensions, and Related Resources
Derived resources and protocols have been developed based on SumeCzech:
- Named Entity annotations (automatic, seven types) for use in entity-focused summarization and ROUGE_NE evaluation (Marek et al., 2021).
- Cross-lingual/monolingual Wikipedia summarization: The XWikis corpus reuses the “SumeCzech” label for a Wikipedia construction, where, for monolingual cs→cs summarization, the Czech Wikipedia article’s full body forms the “document” and the lead paragraph the “summary.” Filtering is based on document length (250–5,000 tokens) and summary length (20–400 tokens), with average document/summary lengths of 890/65 tokens (Perez-Beltrachini et al., 2022).
- Out-of-domain (OOD) test splits: The Straka et al. (2018) split with ≈4.5% reserved for generalization testing, created via K-means clustering of abstracts (Marek et al., 2021).
- Release and format: SumeCzech is distributed in JSONLines format with article–summary pairs, originally released by the Institute of Formal and Applied Linguistics, Charles University, and also via ELRA/LINDAT under research-use terms. The data schema is minimally prescriptive, centered around key fields (Tran et al., 24 Nov 2025, Tran et al., 14 Aug 2025).
6. Evaluation Challenges and Open Research Directions
Despite improved automatic metrics, major challenges persist:
- Named-entity fidelity: Even leading models achieve low entity-level overlap (ROUGE_NE F1 ≃ 5–6% in-domain, ≃1% out-of-domain), indicating significant limitations in semantic precision and sensitivity to rare/unseen entities (Marek et al., 2021).
- Limited out-of-domain robustness: Although entity-augmented models realize minor gains, OOD generalization remains an active research issue.
- Lack of significance testing: Gains in ROUGE have not been statistically validated in principal reports, an open methodological issue (Tran et al., 24 Nov 2025, Tran et al., 14 Aug 2025).
- Insufficient semantic evaluation: BERTScore and related metrics have not yet seen systematic adoption; current protocols are confined to string overlap.
- Linguistic complexity: The rich morphology of Czech motivates the use of subword tokenization (e.g., SentencePiece), but vocabulary size and detailed feature statistics remain underreported.
Recommended future work includes adoption of significance testing, incorporation of semantic evaluation metrics, focused error analysis—especially for named entities and date accuracy—and the extension to single-sentence headline (XSum-style) summarization tasks (Tran et al., 24 Nov 2025, Tran et al., 14 Aug 2025).
7. Impact, Accessibility, and Licensing
SumeCzech is a pivotal resource for both the Czech NLP community and the broader field of multilingual summarization, having enabled the demonstration of state-of-the-art performance for both “medium-resource” morphologically rich languages and for low-resource domain adaptation scenarios. It has also contributed to methodological advances in entity-aware evaluation and extractive/abstractive hybrid pipelines.
Dataset access is via the Charles University NLP group and the ELRA/LINDAT repositories. Licensing is not explicitly detailed in recent LLM-evaluation works; users are directed to consult the original dataset publications for terms and restrictions (Tran et al., 24 Nov 2025, Tran et al., 14 Aug 2025).