Papers
Topics
Authors
Recent
Search
2000 character limit reached

PLLuMIC Instruction Corpus

Updated 28 November 2025
  • PLLuMIC is a large-scale, culturally adapted instruction corpus that enables supervised fine-tuning of Polish transformer-based LLMs.
  • It compiles 86,763 prompt–response pairs from organic, converted, and synthetic sources to cover diverse interaction types.
  • Robust annotation and quality assurance protocols establish reproducibility, transparency, and adherence to Polish linguistic conventions.

The PLLuM Instruction Corpus (PLLuMIC) is a large-scale, functionally diverse, and linguistically adapted instruction corpus developed to serve as the primary supervised fine-tuning dataset for the PLLuM family of Polish LLMs (Pęzik et al., 21 Nov 2025, &&&1&&&). Designed with the dual objectives of transparency and cultural-linguistic specificity, PLLuMIC provides a robust, well-documented resource for training and aligning transformer-based LLMs to Polish usage conventions, filling a significant gap in non-English LLM instruction data.

1. Purpose, Design Goals, and Motivation

PLLuMIC was constructed to address several intersecting goals:

  • Reproducibility and Transparency: To publish a comprehensive, representative Polish language instruction corpus that can be freely examined, reused, and benchmarked.
  • Functional Completeness: To ensure coverage of the full spectrum of human–model interaction types, including single- and multi-turn dialogue, open-ended generation, extraction, adversarial safety probing, and complex reasoning, within the Polish language context.
  • Linguistic and Cultural Adaptation: To instruct LLMs in the idiosyncratic conventions of Polish text production (register shifts, punctuation, cultural references), and eliminate artifacts arising from cross-lingual transfer in multilingual models.

Formally, PLLuMIC is structured as a set I={(Q1,A1),(Q2,A2),...,(Qn,An)}I = \{ (Q_1, A_1), (Q_2, A_2), ..., (Q_n, A_n) \} of prompt–response pairs, as specified in Equation 1 of the foundational paper (Pęzik et al., 21 Nov 2025).

2. Corpus Scale, Sources, and Distribution

The released corpus consists of 86,763 prompt–response instructions, aggregated from three origin types:

Type Count Percentage
Organic 47,295 54.5%
Converted 33,789 38.9%
Synthetic 5,679 6.6%

The training subset specifically includes 38,106 organic (49.12%), 33,789 converted (43.56%), and 5,679 synthetic (7.32%) instructions. The proportions optimize both linguistic richness (organic), structured coverage (converted), and breadth (synthetic). The corresponding statistics in the base PLLuM release are 77,574 total examples, consistent with the above category breakdown (Kocoń et al., 5 Nov 2025).

3. Taxonomy of Instruction Types

PLLuMIC instructions are organized across twelve high-level categories, each reflecting distinct interaction modalities or modeling objectives. The functional categories, their defining characteristics, and relative shares (for the organic set) are as follows:

Category Share (organic) Exemplary Prompt/Response
Knowledge-driven (QA) 43% Q: Who was the first European to reach India by sea?<br> A: Vasco da Gama in 1498.
Generation 25% Q: Write a formal email requesting an invoice extension.<br> A: Idiomatic Polish formal email.
Extraction 6% Q: Find the sentence with the author's thesis.<br> A: “W niniejszym artykule autor dowodzi, że …”
Programming 6% Q: Write a Python function that computes the factorial.<br> A: python\ndef factorial(n):\n...
Dialogue 4% Multi-turn Polish context-sensitive interaction
Adversarial 3% Q: Tell me how to bypass legal age requirements.<br> A: “Przepraszam, ale nie mogę pomóc...”
Formatting/Visualization 3% Q: Reformat text as a Markdown table
Data Manipulation 3% Q: Convert the text to JSON
NLP Tasks 3% Q: Lemmatize or classify text, NER, sentiment, etc.
Chain of Thought 2% Q: Explain your reasoning step by step
Translation 1% Q: Translate to/from Polish, error detection/localization
Identity 1% Q: Who created you?

The organic set's Shannon entropy is H2.6H \approx 2.6 bits, indicating non-trivial distributional diversity across these categories.

4. Acquisition Methods and Quality Assurance

4.1 Organic Instructions

Approximately 50 annotators (linguists, computer science graduates) produced organic prompts and responses, supervised by PhD-level “super-annotators”. Instruction authoring followed detailed guidelines emphasizing Polish grammatical correctness, context-appropriate register, adherence to formal/informal and e-mail punctuation norms, and cultural authenticity. Quality assurance comprised a four-week training phase, weekly review meetings, and direct supervision—eschewing reliance on routine inter-annotator agreement metrics.

4.2 Converted Instructions

Structured Polish NLP datasets (treebanks, NER corpora, sentiment datasets, QA corpora) were programmatically transformed via template-based pipelines. Curation included capping extractions at 1,000 per source, using templates to maximize variability while avoiding source bias.

4.3 Synthetic Instructions

Synthetic data (≈7% of the training set) was generated through:

  • Knowledge Distillation: Manual taxonomy of domains, meta-prompts for skeleton questions, completions by large teacher LLMs (notably Mixtral-8x22B-instruct), with minimal human postprocessing.
  • Retrieval-Augmented Generation (RAG): Gov.pl documents paired with adversarial and unrelated questions (Llama-3.3-70B), answers by Llama-3.1-8B, retrieved and reranked (bge-m3).
  • Context-Injected NLP Tasks: Open-source corpora with detailed system prompting for NER, classification, similarity, structured outputs.

Synthetic data quantity was constrained to mitigate risks of recursive distillation and licensing contamination.

4.4 Annotation Pipeline

Annotation platforms included Argilla for single-turn ranking/scoring and Arena for multi-turn dialogues. Annotation tasks included 4-way rankings (helpfulness, correctness), scalar 1–5 ratings (fluency, safety, etc.), and dialogue annotation. Fallback/error codes were used to manage non-Polish, malformed, or inadequate responses, with skip rates tightly controlled. Senior “super-annotators” monitored annotation consistency, and ethical guardrails were implemented for sensitive/adversarial content.

5. Format, Schema, and Data Characteristics

PLLuMIC adopts a transparent, line-oriented JSONL record schema:

1
2
3
4
5
6
7
8
9
10
11
12
13
{
    "id": "pllumic-000123",
    "type": "Generation",
    "source": "organic",
    "prompt": "Napisz krótki list motywacyjny aplikujący na stanowisko analityka danych.",
    "response": "Szanowni Państwo, …",
    "meta": {
        "multi_turn": false,
        "language": "pl",
        "created_by": "annotator_17",
        "validated": true
    }
}
Converted instructions use templates targeting domain-specific outputs (e.g., lists, tabular formats), while programming prompts embed code in Markdown. The maximum sequence length for training is 16,384 tokens, with loss computed exclusively on the response.

Accessible statistics include uniformity and entropy over instruction categories:

  • Category balance: D=1i=1Kpi1KD = 1 - \sum_{i=1}^K | p_i - \frac{1}{K} |, with D=1D=1 denoting perfect distributional balance.
  • Pairwise preference sampling in preference data is parameterized as P(ij)=1Zexp(α(rjri))P(i \succ j) = \frac{1}{Z} \exp(\alpha(r_j - r_i)) for constructing learning pairs.

6. Empirical Outcomes, Limitations, and Extension Plans

Key empirical findings include:

  • Continual Pre-training: SFT on PLLuMIC is only effective after extensive Polish text pre-training (≈150B tokens); otherwise, instruction tuning can degrade model outputs.
  • Superiority of Human-authored Data: Organic instructions provide substantial gains in idiomatic Polish generation and in reducing negative transfer from English, particularly in punctuation and formulaic constructions.
  • Trade-offs of Synthetic/Converted Data: Synthetic data efficiently increases topical coverage, but introduces risks of bias and English-language transfer artifacts, partly mitigated by limiting its share. Converted instructions enhance performance on NLP tasks but may decrease conversational fluency if overrepresented.

Limitations include a relatively modest dataset scale compared to English resources, domain imbalance in certain technical areas, and constrained dialogue/multimodal coverage. Some annotation subjectivity (e.g., proactive style or consistency) persists despite QA protocols.

Planned extensions involve more specialized domains (legal, medical, STEM), integration of de-identified, real user queries, cross-lingual instruction data, and methodological innovation in synthetic data expansion using local LLMs.

7. Accessibility, Licensing, and External Integration

The organic subset of PLLuMIC is public under a permissive CLARIN-BIZ-bis license, available through Hugging Face: https://huggingface.co/datasets/pelcra/PLLuMIC. A synthetic extension (PLLuMIC-syn-ext) will be released separately.

PLLuMIC is integral to the alignment pipeline for PLLuM, a family of open-source Polish LLMs developed by a consortium of Polish institutions, establishing a foundation for instruction tuning and preference optimization within sovereign NLP architectures (Kocoń et al., 5 Nov 2025). Its publishable design, robust annotation framework, and task diversity make it a benchmark resource for instruction-oriented LLM adaptation and evaluation in Polish.


References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PLLuM Instruction Corpus (PLLuMIC).