T-Wix 500k Instruction Corpus

Updated 12 December 2025

T-Wix 500k Instruction Corpus is an extensive, bilingual supervised fine-tuning dataset featuring 500,000 instruction-response pairs for Russian LLM development.
It aggregates data from diverse sources including public datasets, web forums, and synthetic tasks with rigorous deduplication and quality control protocols.
The corpus supports varied applications from factual answering and chain-of-thought reasoning to long-context summarization and bilingual language modeling.

The T-Wix 500k Instruction Corpus is an open, supervised fine-tuning (SFT) dataset comprising 500,000 instruction–response pairs in Russian and English, curated to support the training of Russian-centric LLMs for both general instruction following and explicit multi-step reasoning capabilities. Designed for domain-balance, linguistic diversity, and rigorous quality control, T-Wix serves as a foundational resource for advancing Russian-language LLMs in factual answering, reasoning-trace generation, and applied language modeling tasks (Stoianov et al., 11 Dec 2025).

1. Corpus Composition and Construction

T-Wix aggregates data from diverse public sources and methodical synthetic generation. The raw collection process began with approximately 14 million general instruction–response pairs gathered from:

Existing public SFT datasets (including Alpaca-style and ru-adapt resources)
Web and forum QA threads
User–assistant dialogues
Synthetic tasks for coverage completeness

A dedicated long-context subset (8,000–32,000 tokens) was assembled via public domain sources and further augmented using summarization, question answering, and reasoning prompt templates. For bilingual and cross-lingual robustness, roughly 10% of the corpus consists of parallel Russian–English pairs. Additionally, approximately 450,000 open-source English-language reasoning instructions were incorporated from benchmarks such as Open-R1, Nvidia AceReason-Math, and Nemotron, encompassing mathematical, scientific, algorithmic, and code-focused reasoning tasks.

Filtering involved exact-match and embedding-based deduplication (MinHash/LSH), exclusion of instructions with direct or semantic matches to evaluation sets, and domain stratification using thematic tagging (InsTag) over six domains (Math, Code, Science, General Instruction, General Knowledge, Writing) and three cognitive tiers (School, Student, Professor). Reward-model (RM) scoring eliminated the bottom 10% of samples by predicted prompt/completion quality, and an Instruction-Following Difficulty (IFD) filter removed tasks classified as trivially easy (IFD < 0.7) or ambiguous (IFD > 1.0). For reasoning samples, post-translation deduplication, density filtering, solution candidate generation (eight from Qwen3-235B-A22B as teacher, eight from the LLM-in-training as student), RM scoring, and zone-of-proximal-development selection controls were applied.

All instance formatting was standardized to a uniform "<user prompt> → <assistant response>" schema, with multi-turn dialogues flattened into single-turn format by concatenating up to 32k tokens of conversational context. Assistant responses were regenerated using high-capacity teacher LLMs and then RM-filtered for style and correctness (Stoianov et al., 11 Dec 2025).

2. Instruction Schema and Example Types

Instructions in T-Wix span varied pedagogical and application domains:

Question–Answer (factoid and general knowledge QA)
Chain-of-thought reasoning (multi-step mathematics, logic, scientific explanations)
Code and programming tasks (function implementation, debugging, code explanation)
Summarization and long-context comprehension
Writing and style transformation (paraphrase, translation, rewriting)

Canonical input–output prompt templates include:

Single-turn:

1 2	Пользователь: <instruction> Ассистент:

Chain-of-thought:

1 2	Вам дано задание: <problem statement> Пожалуйста, приведите пошаговое рассуждение и финальный ответ.

Code:

1 2	Напишите функцию на Python, которая ... Ответ:

Concrete examples:

Instruction Domain	Prompt–Response Example (abridged)
General QA	Пользователь: “Кто автор «Евгения Онегина»?” Ассистент: “Автор — Александр Пушкин.”
Mathematical Reasoning	Пользователь: “Докажите, что сумма углов треугольника равна 180°.” Ассистент: “1) Рассмотрите треугольник... 2) Постройте параллельную линию... Ответ: 180°.”
Code Generation	Пользователь: “Напишите функцию reverse_string(s) на Python.” Ассистент: `def reverse_string(s): return s[::-1]`
Long-Context Summarization	Пользователь: “Кратко изложите содержание отрывка…” Ассистент: “В этом отрывке говорится о…”

This schema enables both direct-answering and interpretable reasoning chains (Stoianov et al., 11 Dec 2025).

3. Data Distribution and Corpus Statistics

The T-Wix corpus consists of 500,000 samples ( $|D| = 5\times10^{5}$ ). It is organized as follows:

Subset	Samples (Count; %)	Primary Domains/Remarks
General SFT ( $D_{gen}$ )	$\approx$ 468,000 (93.6%)	Math (28%), Code (16%), Science (12%), General QA (24%), Knowledge/Writing (20%)
Reasoning SFT ( $D_{reason}$ )	$\approx$ 30,000 (6.0%)	Explicit chain-of-thought reasoning
Long-context	$\approx$ 5,000 (1%)	Sequences up to 32k tokens
Language mix	90% Russian, 10% English	Russian focus with bilingual support

Optional train/validation/test splits follow a 98%/1%/1% stratification by domain. Each record is annotated with domain tags, reward-model and IFD scores, and relevant metadata.

4. Tokenization and Preprocessing

Tokenization utilizes a Cyrillic-dense tokenizer derived from Qwen3, replacing 34,000 low-frequency non-Cyrillic tokens with new Cyrillic merges while maintaining a 128k vocabulary. This adaptation reduced the mean tokens per word in T-Wix from 2.70 to 2.26 and increased the proportion of words tokenized in two or fewer tokens from 52% to 65%.

Corpus-wide and Wikipedia statistics are as follows:

Text Source	Tokens/Word (before→after)	≤2 Tokens (%) (before→after)
ruWiki	3.12 → 2.38	38 → 60
T-Wix	2.70 → 2.26	52 → 65

Preprocessing steps included Unicode NFC normalization, context-sensitive lowercasing, removal of control/zero-width characters, sentence segmentation and context packing (≤32k tokens), and filtering of samples with invalid UTF-8 or excessively long tokens ( $>$ 50 characters) (Stoianov et al., 11 Dec 2025).

5. Licensing and Data Accessibility

T-Wix 500k is distributed under the Open Data Commons Attribution (ODC-By) license, permitting both academic and commercial utilization with attribution. It is available via the Hugging Face Hub at https://huggingface.co/datasets/t-tech/T-Wix, and can be loaded using:

$D_{gen}$ 0

Accompanying metadata comprises instruction templates, domain tags, reward-model scores, and IFD assessments. The open licensing and clear data provenance facilitate reproducible research in Russian-language LLM development (Stoianov et al., 11 Dec 2025).

6. Applications and Research Significance

T-Wix 500k addresses key requirements for training high-quality, instruction-tuned Russian LLMs, supporting both balanced general-purpose SFT and explicit reasoning-task finetuning. Notable applications include:

Fine-tuning LLMs for Russian instruction-following with strong generalization across factual, reasoning, code, and summarization tasks
Evaluation of chain-of-thought and explainability benchmarks in Russian, using paired solution traces derived from teacher–student modeling paradigms
Development of long-context LLM applications (e.g., summarization and document-level QA) leveraging >8k token prompts
Bilingual model adaptation and benchmarking, given the 10% English sample share

Its construction paradigm provides rigorous filtering, diverse domain representation, and explicit annotation, enabling researchers to probe LLM instruction-following, reasoning generalization, and the effects of data curation on Russian-LLMs (Stoianov et al., 11 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to T-Wix 500k Instruction Corpus.