Papers
Topics
Authors
Recent
2000 character limit reached

Multilingual Genre Classification Dataset

Updated 5 December 2025
  • Multilingual genre classification datasets are curated collections that label texts by genre across various languages.
  • They support cross-lingual transfer learning by providing benchmarks for models in areas like digital humanities and computational linguistics.
  • Inclusion of rich linguistic features such as syntactic parses and metaphor counts enhances the accuracy of genre recognition.

A multilingual genre classification dataset is a curated, annotated collection of textual data spanning multiple natural languages, explicitly labeled by genre and intended for the training, evaluation, or benchmarking of models for genre recognition. Such datasets are central to research in computational linguistics, cross-lingual transfer learning, digital humanities, music information retrieval, and historical linguistics. The construction and annotation criteria, label taxonomies, and feature representations vary significantly across available resources, reflecting diverse theoretical notions of genre, application domains, and levels of granularity.

1. Benchmark Datasets: Composition and Scope

Datasets for multilingual genre classification differ substantially in size, language coverage, annotation protocol, and domain:

  • Project Gutenberg Multilingual Literary Dataset: Introduced in "LLMs Know More Than Words: A Genre Study with Syntax, Metaphor & Phonetics," this dataset comprises ~45,000 sentences annotated for three-way canonical literary genre—drama, poetry, and novel—in six major European languages (EN, FR, DE, IT, ES, PT). Each sentence is mapped to one of three binary classification tasks (Poetry/Novel, Drama/Poetry, Drama/Novel) and enriched with explicit computational linguistic features. The source data are public-domain works (Shi et al., 4 Dec 2025).
  • CLASSLA-web South Slavic Web Corpora: The CLASSLA-web resource consists of 13 billion tokens across 26 million documents in seven South Slavic languages. Genre is annotated at the document level using a nine-way taxonomy (news, legal, promotion, forum, prose/lyrical, etc.), with classification performed automatically using a Transformer-based multilingual model fine-tuned on diverse manually labeled corpora (Ljubešić et al., 19 Mar 2024).
  • ArzEn-MultiGenre: A parallel, segment-aligned dataset of Egyptian Arabic and English, covering song lyrics, novels, and TV/film subtitles. It comprises 25,557 segment pairs, each labeled with one of three genres at the segment level (Al-Sabbagh, 2 Aug 2025).
  • Multilingual Scientific Paragraphs: 833,000 research-paper paragraphs in primarily English and French, classified into acknowledgment, data mention, software/code, or clinical trial mention. Class assignment is entirely rule-based without manual adjudication (Jeangirard, 13 Oct 2025).
  • Music Genre Lyrics Dataset: Over 230,000 bilingual (Portuguese-English) song lyrics assigned multiple non-exclusive music genre labels, used for multi-label cross-lingual classification (Tavares et al., 7 Jan 2025).
  • 19th-century Ottoman Turkish and Russian Literary-Critical Corpus: 2,877 documents with multi-level, multi-label manual annotations organized within a four-level taxonomic hierarchy, built for the study of historical and low-resource languages (Gokceoglu et al., 21 Jul 2024).

These datasets differ in genre concept (literary form, functional category, fan-assigned label), unit of annotation (sentence, segment, document, paragraph), and language family/era.

2. Annotation Schemes and Taxonomies

Annotation taxonomies for genre are corpus- and domain-dependent:

  • Hierarchical Literary Taxonomy (Gokceoglu et al., 21 Jul 2024): Structured multi-level labels capturing both macro categories (literary vs. cultural discourse) and fine sub-genres (e.g., "poetry," "biography," "travels"), permitting multi-labeling through parallel "type" and "subject" slots.
  • Flat Canonical Genres (Shi et al., 4 Dec 2025, Al-Sabbagh, 2 Aug 2025): Single-level categorical labels—e.g., {drama, poetry, novel} or {song, novel, subtitle}.
  • Functional/Domain Genres (Jeangirard, 13 Oct 2025): Paragraph assignment to "acknowledgment," "data," etc., driven by heuristics, not literary or stylistic features.
  • Web/NLP Genre Taxonomy (Ljubešić et al., 19 Mar 2024): Nine classes reflecting information structure and communicative intent: news, instruction, legal, promotion, forum, opinion, information/explanation, prose/lyrical, mix, other. A "mix" label is used for ambiguous or multi-genre cases (if classifier confidence <0.8).

The existence of both single-label (mutually exclusive) and multi-label (overlapping) schemes reflects divergent theoretical approaches and the realities of genre boundary fuzziness in many corpora.

3. Feature Extraction and Data Representation

Leading datasets offer not only raw texts and labels but also explicit, computationally derived linguistic features to support genre-informed modeling:

  • Project Gutenberg Dataset (Shi et al., 4 Dec 2025):
    • Syntactic Structure: spaCy+Benepar-based constituency parses, summarized by tree depth dd and normalized depth-to-length ratio r=d/Sr = d/|S|.
    • Metaphor Counts: Token-level metaphor annotation using a fine-tuned RoBERTa, then collapsed to a sentence-level metaphor count mm.
    • Phonetic Metre: PoetryTools-derived stress vectors (binary patterns for up to 20 syllables), with optional measures such as vowel fraction and metrical regularity.
    • All features are concatenated to base token/sequence embeddings for model input.
  • CLASSLA-web (Ljubešić et al., 19 Mar 2024):
    • Document-level linguistic annotation with CLASSLA-Stanza, providing POS tags, dependencies, and lemmatization.
    • Genre assigned via xlm-roberta-based classifier using first-token pooling and gold-mapped training sets.
  • Music Lyrics Dataset (Tavares et al., 7 Jan 2025):
    • sBERT multilingual sentence-level embeddings (dimension 768), mean-pooled at song level.
    • No explicit linguistic features beyond genre indicator variables.
  • Segmented Parallel Datasets (Al-Sabbagh, 2 Aug 2025):
    • Primarily sentence or segment pairs with genre label; no explicit feature engineering.

A plausible implication is that explicit, linguistically motivated features can enhance classification performance—especially for nuanced distinctions such as poetry/novel—though their utility appears task-dependent.

4. Benchmarking Protocols and Evaluation

Experimental protocols span a range of modeling paradigms and evaluation strategies:

Metrics include standard accuracy, macro/micro F₁, mean average precision (mAP), precision, recall, as well as more specialized metrics such as AP@0.5\mathrm{AP}@0.5, AR@0.5\mathrm{AR}@0.5, and AF1@0.5\mathrm{AF}[email protected] for multi-label setups. Stratified splits and bootstrapping are used for robust error estimation.

A representative result: adding metrical features to RoBERTa yields \sim2–3% F₁ improvement on poetry/novel separation (Shi et al., 4 Dec 2025), while in other cases, feature augmentation is neutral or negative (notably for drama/novel). In the historical Turkish/Russian corpus, classical BoW features can outperform LLM-based fine-tuning, indicating persistent challenges in LLM application in low-resource, long-text, or complex taxonomic environments (Gokceoglu et al., 21 Jul 2024).

5. Access Modalities, Data Schemas, and Licensing

Dataset accessibility and structure enable reproducibility and adaptation:

  • File formats: Range from plain text (one sentence per line) and CONLL-U (CLASSLA-web) to CSV/JSONL (literary datasets), XLSX (segment-aligned translation datasets), and UTF-8 CSV (scientific paragraphs corpus).
  • Schema: Typically includes unique ID, text, language, genre label(s), split, and any derived feature fields (e.g., tree_depth, metaphor_count, stress_pattern).
  • Licensing: Project Gutenberg-derived and many academic resources are CC-BY-4.0; others may use Apache 2.0 (historical Turkish/Russian) or require attribution under custom research licenses (CLASSLA-web via CLARIN.SI).
  • Practical access: Most resources reside on open repositories such as HuggingFace, CLARIN.SI, Mendeley Data, or university archives. Some (notably, processed music lyrics) are not directly published, encouraging communication with original authors.

6. Domain-Specific and Historical Instances

Domain and era define genre signal and annotation complexity:

  • Web Domain: CLASSLA-web’s genre model is explicitly built for South Slavic web corpora, demonstrating systematic divergence in genre prevalence by economic status—with heavy news dominance in less-developed contexts and more promotional/opinionative material in wealthier web spheres (Ljubešić et al., 19 Mar 2024).
  • Scientific Domain: Automatic section and type detectors enable scalable labeling but introduce error propagation and English language dominance (Jeangirard, 13 Oct 2025).
  • Parallel Literary/Media Domains: Segment-level genre in ArzEn-MultiGenre is mostly dictated by folder organization, with no internal genre subtyping (Al-Sabbagh, 2 Aug 2025).
  • Historic Literary Data: The Ottoman Turkish/Russian corpus demonstrates the feasibility and limitations of multi-level labeling under expert adjudication and the challenges facing LLMs in dealing with complex, low-resource historical data (Gokceoglu et al., 21 Jul 2024).

7. Extension Strategies and Persistent Challenges

Common extension strategies for multilingual genre datasets include:

  • Acquisition of new public-domain or licensed corpora per genre and language (Shi et al., 4 Dec 2025).
  • Application or retraining of parser, stress, and metaphor detection models for new linguistic contexts.
  • Adoption of stratified sampling or data augmentation for class imbalance, especially when certain genres are rare (e.g., legal, lyrical prose in web data).
  • Enhancement of taxonomies to capture genre hierarchies, subject-topic intersections, or hybrid “mix” classes.
  • Expansion to new genres (blogs, social media, interviews), dialects, or scripts (Arabic, Turkish, Ottoman, etc.).

Challenges persist in annotation consistency, cross-lingual comparability, handling of fuzzy/mixed genres, and effective utilization of linguistic features in modern LLM frameworks. Low-resource and historical domains show that classical BoW methods may still be competitive or superior to parameter-heavy LLMs under certain conditions (Gokceoglu et al., 21 Jul 2024). One plausible implication is that model selection and feature design should be tightly coupled to annotation protocols, domain-specific genre theorization, and intended downstream applications.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Multilingual Genre Classification Dataset.