T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground

Published 11 Dec 2025 in cs.CL | (2512.10430v1)

Abstract: We introduce T-pro 2.0, an open-weight Russian LLM for hybrid reasoning and efficient inference. The model supports direct answering and reasoning-trace generation, using a Cyrillic-dense tokenizer and an adapted EAGLE speculative-decoding pipeline to reduce latency. To enable reproducible and extensible research, we release the model weights, the T-Wix 500k instruction corpus, the T-Math reasoning benchmark, and the EAGLE weights on Hugging Face. These resources allow users to study Russian-language reasoning and to extend or adapt both the model and the inference pipeline. A public web demo exposes reasoning and non-reasoning modes and illustrates the speedups achieved by our inference stack across domains. T-pro 2.0 thus serves as an accessible open system for building and evaluating efficient, practical Russian LLM applications.

Abstract PDF Upgrade to Chat

Authors (15)

Summary

The paper demonstrates that a targeted Cyrillic tokenizer and specialized reasoning SFT, combined with advanced speculative decoding, markedly improve token efficiency and reasoning performance in Russian LLMs.
It introduces the T-Wix corpus and rigorous reward-guided training that significantly boost performance metrics, outperforming comparable open and proprietary models on multiple benchmarks.
The system’s interactive playground and deployment on NVIDIA H100 validate T-pro 2.0’s low-latency, efficient inference capabilities for research and production applications.

T-pro 2.0: Design, Training, and Evaluation of an Efficient Russian Hybrid-Reasoning Model

Introduction and Motivation

T-pro 2.0 addresses a significant gap in the Russian-language LLM landscape by delivering an accessible, open-weight model capable of both direct answering and stepwise reasoning trace generation. Prior LLMs for Russian were either proprietary or derived from multilingual foundations with suboptimal adaptation for Cyrillic-heavy content, resulting in degraded token efficiency and lower reasoning performance. T-pro 2.0 systematically resolves these bottlenecks through three axes: a Russian-centric tokenizer, a large-scale reasoning-focused supervised corpus, and an advanced speculative decoding pipeline built for reduced inference latency.

Figure 1: Screenshot of the system demo of the T-pro 2.0 EAGLE.

System Architecture and Interactive Playground

T-pro 2.0 is publicly deployed as an interactive, research-oriented demo, supporting both reasoning ("think") and non-reasoning modes. The Web UI provides real-time side-by-side completion comparison (defaulting to Qwen3-32B-Instruct as a baseline), exposure of all inference parameters, and live telemetry (tokens per second, speculative acceptance rate, etc.), supporting reproducibility and granular analysis of model behavior and latency-performance trade-offs.

Backend inference is served via SGLang: T-pro 2.0 is paired with an EAGLE-style speculative sampling pipeline comprising a lightweight single-layer draft model and a 32B verifier, while the baseline uses standard AR decoding. Each endpoint runs on an NVIDIA H100, supporting interactive and batch workloads for cross-model evaluation at low latency.

Tokenization and Data Pipeline

A crucial innovation in T-pro 2.0 is the replacement of 34K low-frequency non-Cyrillic entries in the Qwen3 vocabulary with Cyrillic tokens. This augmentation achieves a substantial reduction in average tokens per word for Russian and seven other Cyrillic-script languages, without impacting English tokenization efficiency.

Figure 2: T-pro design highlights Cyrillic-centric tokenizer architecture, optimizing token efficiency for Russian.

Comparative analysis shows the share of Russian words split into ≤2 tokens increases from 38% (Qwen3) to 60% (T-pro 2.0) on Wikipedia. The compression generalizes to all tested Cyrillic languages except Kazakh, placing T-pro 2.0 at or near SOTA in Cyrillic token segmentation among open models.

Supervised Fine-Tuning (T-Wix)

The T-Wix SFT corpus comprises ∼500K samples, combining 468K general instructions with 30K distilled reasoning traces across math, science, code, and open-ended tasks. Strict domain/difficulty balancing and comprehensive deduplication (both semantic and exact-match) are employed at all stages, including MinHash-based filtering against benchmarks to eliminate leakage.

Figure 3: Token distribution in T-Wix, illustrating domain and difficulty balancing for SFT.

Reward model (RM) filtering and on-policy DPO preference optimization further refine alignment, with 16 completions per instruction scored for contrastive selection. In the reasoning split, reward-guided data selection purposely operates within a calculated "zone of proximal development," retaining only samples with moderate teacher-student reward gaps and low intra-sample RM variance to optimize distillability.

Model Training and Adaptation

Midtraining and Tokenizer Adaptation

Instructional midtraining is performed for 40B tokens on a primarily Russian/English instruction mix. Empirical results indicate most adaptation to the dense Russian tokenizer occurs within the first ∼1k steps (∼4B tokens):

Figure 4: Midtraining loss, showing rapid adaptation to the new tokenizer precedes plateauing.

Ablations show that instruction-only midtraining, eschewing inherited raw pretraining content for exclusively instruction-formatted examples, produces superior downstream reasoning performance. On ruAIME’24, for instance, accuracy increases from 0.60 (mixed) to 0.67 (instruction-only). The T-pro tokenizer model marginally surpasses original Qwen3 on MERA (0.574 vs. 0.560 at 8B scale), indicating the denser Cyrillic segmentation incurs no loss of general capability.

Reward Model and Preference Training

A Qwen3-32B-based RM is trained with a Bradley–Terry objective using synthetic tournament-based annotations (knockout tournaments with pairwise comparisons). On Arena-Hard Best-of-N, this RM achieves a ΔBoN ("best–worst") of 22.21, decisively outperforming comparable open multilingual reward models, reflecting superior preference discrimination on real user queries.

On-policy DPO is conducted over 100K high-contrast pairs, with a 90:10 split between general and reasoning instructions, enhancing coherence, alignment, and stepwise reasoning structure under realistic sampling.

Speculative Decoding and Inference Efficiency

T-pro 2.0 integrates an EAGLE v2 speculative decoding module. The draft model (single Llama2-based layer with FR-Spec augmentation) is jointly trained for hidden-state reconstruction and token distribution alignment using KL divergence. Deployment with SGLang dynamic draft trees achieves, at temperature 0.8, an average speedup of 1.85× (standard mode) and 1.83× (reasoning mode) over standard AR decoding in interactive serving, with acceptance lengths above 3 tokens per verification. STEM domains exhibit greater acceleration (1.99×) than humanities (1.62×), paralleling more predictable token transitions in technical content.

Benchmark Performance

Russian-Language Evaluation

T-pro 2.0 demonstrates competitive or superior performance relative to strong open and closed models at the 27B–32B scale:

General Knowledge: MERA 0.66, ruMMLU-Pro 0.697 (near GPT-4o at 0.714, above RuadaptQwen3-32B-Instruct)
Dialogue: Arena Hard Ru 91.1, WildChat Hard Ru 72.6, Arena Hard 2 (HP/CW) 53.5/64.2—all metrics surpassing equivalent-scale open and proprietary Russian models, consistent across both direct answering and reasoning modes.

Russian Reasoning

T-Math Benchmark: 0.541
ruAIME 2024/2025: 0.704 / 0.646
ruMATH-500: 0.94
Vikhr Math: 0.799

Scores on T-Math, a 331-problem collection from Russian olympiads, empirically differentiate T-pro 2.0 from adaptation-based and translation-based alternatives, and expose a non-saturated performance frontier—no model exceeds 0.75 pass@1.

English Generalization

Despite its Russian-centric innovations, T-pro 2.0 maintains strong English proficiency:

AIME 2024: 0.765
MATH-500: 0.966
GPQA Diamond: 0.641

Results on the English splits remain within a narrow margin of, or exceed, comparable SFT or distilled backbones (Qwen3-32B, DeepSeek-R1-Distill-Qwen-32B).

Implications and Future Directions

T-pro 2.0 underscores the significance of dedicated tokenizer adaption and instruction-focused midtraining for non-English LLMs, especially in morphologically rich and undersupported languages. The model’s open weights, dataset, and inference stack establish a reproducible, extensible baseline for further research, while the T-Math and T-Wix resources set new standards for rigorous evaluation and SFT corpus construction in Russian.

The efficient speculative decoding pipeline facilitates the deployment of large autoregressive models in interactive and production contexts without substantial architectural rework.

Potential future work includes:

Integration of agentic and tool-use abilities (e.g., function calling, toolformer-style augmentation)
Empirical validation and optimization for 128k-token contexts, leveraging RoPE scaling
Refinement of online/offline RL for alignment and robustness on out-of-distribution reasoning tasks
Draft model quantization and EAGLE v3 upgrades for further inference acceleration

Conclusion

T-pro 2.0 represents a technically rigorous solution for efficient, reasoning-capable Russian-language LLMs that balances architectural adaptation, corpus quality, and advanced inference methods. The system contributes open, high-quality models and datasets, together with a transparent evaluation playground. The results provide clear evidence that dedicated tokenizer engineering, comprehensive reasoning SFT, and practical inference optimization are essential for scaling LLM capabilities beyond English, and lay the groundwork for further advances in the Russian-language LLM ecosystem.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We found no open problems mentioned in this paper.

T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground

Summary

T-pro 2.0: Design, Training, and Evaluation of an Efficient Russian Hybrid-Reasoning Model

Introduction and Motivation

System Architecture and Interactive Playground

Tokenization and Data Pipeline

Supervised Fine-Tuning (T-Wix)

Model Training and Adaptation

Midtraining and Tokenizer Adaptation

Reward Model and Preference Training

Speculative Decoding and Inference Efficiency

Benchmark Performance

Russian-Language Evaluation

Russian Reasoning

English Generalization

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground

Summary

T-pro 2.0: Design, Training, and Evaluation of an Efficient Russian Hybrid-Reasoning Model

Introduction and Motivation

System Architecture and Interactive Playground

Tokenization and Data Pipeline

Supervised Fine-Tuning (T-Wix)

Model Training and Adaptation

Midtraining and Tokenizer Adaptation

Reward Model and Preference Training

Speculative Decoding and Inference Efficiency

Benchmark Performance

Russian-Language Evaluation

Russian Reasoning

English Generalization

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research