T-Wix 500k Instruction Corpus
- T-Wix 500k Instruction Corpus is an extensive, bilingual supervised fine-tuning dataset featuring 500,000 instruction-response pairs for Russian LLM development.
- It aggregates data from diverse sources including public datasets, web forums, and synthetic tasks with rigorous deduplication and quality control protocols.
- The corpus supports varied applications from factual answering and chain-of-thought reasoning to long-context summarization and bilingual language modeling.
The T-Wix 500k Instruction Corpus is an open, supervised fine-tuning (SFT) dataset comprising 500,000 instruction–response pairs in Russian and English, curated to support the training of Russian-centric LLMs for both general instruction following and explicit multi-step reasoning capabilities. Designed for domain-balance, linguistic diversity, and rigorous quality control, T-Wix serves as a foundational resource for advancing Russian-language LLMs in factual answering, reasoning-trace generation, and applied language modeling tasks (Stoianov et al., 11 Dec 2025).
1. Corpus Composition and Construction
T-Wix aggregates data from diverse public sources and methodical synthetic generation. The raw collection process began with approximately 14 million general instruction–response pairs gathered from:
- Existing public SFT datasets (including Alpaca-style and ru-adapt resources)
- Web and forum QA threads
- User–assistant dialogues
- Synthetic tasks for coverage completeness
A dedicated long-context subset (8,000–32,000 tokens) was assembled via public domain sources and further augmented using summarization, question answering, and reasoning prompt templates. For bilingual and cross-lingual robustness, roughly 10% of the corpus consists of parallel Russian–English pairs. Additionally, approximately 450,000 open-source English-language reasoning instructions were incorporated from benchmarks such as Open-R1, Nvidia AceReason-Math, and Nemotron, encompassing mathematical, scientific, algorithmic, and code-focused reasoning tasks.
Filtering involved exact-match and embedding-based deduplication (MinHash/LSH), exclusion of instructions with direct or semantic matches to evaluation sets, and domain stratification using thematic tagging (InsTag) over six domains (Math, Code, Science, General Instruction, General Knowledge, Writing) and three cognitive tiers (School, Student, Professor). Reward-model (RM) scoring eliminated the bottom 10% of samples by predicted prompt/completion quality, and an Instruction-Following Difficulty (IFD) filter removed tasks classified as trivially easy (IFD < 0.7) or ambiguous (IFD > 1.0). For reasoning samples, post-translation deduplication, density filtering, solution candidate generation (eight from Qwen3-235B-A22B as teacher, eight from the LLM-in-training as student), RM scoring, and zone-of-proximal-development selection controls were applied.
All instance formatting was standardized to a uniform "<user prompt> → <assistant response>" schema, with multi-turn dialogues flattened into single-turn format by concatenating up to 32k tokens of conversational context. Assistant responses were regenerated using high-capacity teacher LLMs and then RM-filtered for style and correctness (Stoianov et al., 11 Dec 2025).
2. Instruction Schema and Example Types
Instructions in T-Wix span varied pedagogical and application domains:
- Question–Answer (factoid and general knowledge QA)
- Chain-of-thought reasoning (multi-step mathematics, logic, scientific explanations)
- Code and programming tasks (function implementation, debugging, code explanation)
- Summarization and long-context comprehension
- Writing and style transformation (paraphrase, translation, rewriting)
Canonical input–output prompt templates include:
- Single-turn:
1 2 |
Пользователь: <instruction> Ассистент: |
- Chain-of-thought:
1 2 |
Вам дано задание: <problem statement> Пожалуйста, приведите пошаговое рассуждение и финальный ответ. |
- Code:
1 2 |
Напишите функцию на Python, которая ... Ответ: |
Concrete examples:
| Instruction Domain | Prompt–Response Example (abridged) |
|---|---|
| General QA | Пользователь: “Кто автор «Евгения Онегина»?” Ассистент: “Автор — Александр Пушкин.” |
| Mathematical Reasoning | Пользователь: “Докажите, что сумма углов треугольника равна 180°.” Ассистент: “1) Рассмотрите треугольник... 2) Постройте параллельную линию... Ответ: 180°.” |
| Code Generation | Пользователь: “Напишите функцию reverse_string(s) на Python.” Ассистент: def reverse_string(s): return s[::-1] |
| Long-Context Summarization | Пользователь: “Кратко изложите содержание отрывка…” Ассистент: “В этом отрывке говорится о…” |
This schema enables both direct-answering and interpretable reasoning chains (Stoianov et al., 11 Dec 2025).
3. Data Distribution and Corpus Statistics
The T-Wix corpus consists of 500,000 samples (). It is organized as follows:
| Subset | Samples (Count; %) | Primary Domains/Remarks |
|---|---|---|
| General SFT () | 468,000 (93.6%) | Math (28%), Code (16%), Science (12%), General QA (24%), Knowledge/Writing (20%) |
| Reasoning SFT () | 30,000 (6.0%) | Explicit chain-of-thought reasoning |
| Long-context | 5,000 (1%) | Sequences up to 32k tokens |
| Language mix | 90% Russian, 10% English | Russian focus with bilingual support |
Optional train/validation/test splits follow a 98%/1%/1% stratification by domain. Each record is annotated with domain tags, reward-model and IFD scores, and relevant metadata.
4. Tokenization and Preprocessing
Tokenization utilizes a Cyrillic-dense tokenizer derived from Qwen3, replacing 34,000 low-frequency non-Cyrillic tokens with new Cyrillic merges while maintaining a 128k vocabulary. This adaptation reduced the mean tokens per word in T-Wix from 2.70 to 2.26 and increased the proportion of words tokenized in two or fewer tokens from 52% to 65%.
Corpus-wide and Wikipedia statistics are as follows:
| Text Source | Tokens/Word (before→after) | ≤2 Tokens (%) (before→after) |
|---|---|---|
| ruWiki | 3.12 → 2.38 | 38 → 60 |
| T-Wix | 2.70 → 2.26 | 52 → 65 |
Preprocessing steps included Unicode NFC normalization, context-sensitive lowercasing, removal of control/zero-width characters, sentence segmentation and context packing (≤32k tokens), and filtering of samples with invalid UTF-8 or excessively long tokens (50 characters) (Stoianov et al., 11 Dec 2025).
5. Licensing and Data Accessibility
T-Wix 500k is distributed under the Open Data Commons Attribution (ODC-By) license, permitting both academic and commercial utilization with attribution. It is available via the Hugging Face Hub at https://huggingface.co/datasets/t-tech/T-Wix, and can be loaded using:
1 2 |
from datasets import load_dataset ds = load_dataset("t-tech/T-Wix") |
Accompanying metadata comprises instruction templates, domain tags, reward-model scores, and IFD assessments. The open licensing and clear data provenance facilitate reproducible research in Russian-language LLM development (Stoianov et al., 11 Dec 2025).
6. Applications and Research Significance
T-Wix 500k addresses key requirements for training high-quality, instruction-tuned Russian LLMs, supporting both balanced general-purpose SFT and explicit reasoning-task finetuning. Notable applications include:
- Fine-tuning LLMs for Russian instruction-following with strong generalization across factual, reasoning, code, and summarization tasks
- Evaluation of chain-of-thought and explainability benchmarks in Russian, using paired solution traces derived from teacher–student modeling paradigms
- Development of long-context LLM applications (e.g., summarization and document-level QA) leveraging >8k token prompts
- Bilingual model adaptation and benchmarking, given the 10% English sample share
Its construction paradigm provides rigorous filtering, diverse domain representation, and explicit annotation, enabling researchers to probe LLM instruction-following, reasoning generalization, and the effects of data curation on Russian-LLMs (Stoianov et al., 11 Dec 2025).