Papers
Topics
Authors
Recent
Search
2000 character limit reached

T-pro 2.0: Russian LLM for Hybrid Reasoning

Updated 12 December 2025
  • T-pro 2.0 is an open-weight Russian LLM that supports dual generation modes, offering both direct answers and explicit chain-of-thought reasoning in a single checkpoint.
  • It features a Cyrillic-dense tokenizer that reduces average tokens per word, enhancing efficiency in processing Russian and other Slavic languages.
  • The platform uses an EAGLE speculative decoding pipeline to achieve low-latency inference, delivering up to 2× speed improvements on diverse benchmarks.

T-pro 2.0 is an open-weight Russian LLM optimized for efficient hybrid reasoning and inference, released with full model weights, fine-tuning datasets, benchmarking resources, and an inference accelerator pipeline. The system leverages a heavily customized pipeline—including a Cyrillic-dense tokenizer and EAGLE-style speculative decoding—to enable practical, low-latency deployment for both direct answering and explicit reasoning-trace generation. Developed on top of the Qwen3-32B decoder-only architecture and released with permissive licenses, T-pro 2.0 provides a reproducible research platform for Russian-language LLM reasoning, benchmarking, and application development (Stoianov et al., 11 Dec 2025).

1. Model Architecture, Training Regimes, and Tokenizer

T-pro 2.0 is based on the Qwen3-32B decoder-only transformer, retaining its parameterization (∼32B parameters, 32 layers, ~12,288 hidden dimension, 32 attention heads/layer, native 32k token context, RoPE-scalable to 128k). All training stages—midtraining, supervised fine-tuning (SFT), and preference alignment—are conducted at 32k-token context using the Adam/AdamW optimizers (cosine LR decay, full-shard FSDP, BF16 precision, gradient clipping) on NVIDIA H100 hardware.

Instructional midtraining employs 40B tokens, partitioned as 49% Russian, 36% English, 5.5% parallel (en–ru), and 9.3% code data, with domain balancing (34.6% reasoning, 28.8% general QA, 16.2% math, and others). Dataset cleaning uses MinHash (benchmark deduplication) and In-sTag semantic filters.

A crucial innovation is the Cyrillic-dense tokenizer: derived from Qwen3's vocabulary, 34k low-frequency non-Cyrillic tokens are substituted with 34k high-frequency Cyrillic merges drawn from Qwen3, RuAdapt, cl100k_base, and mGPT. This reduces average tokens/word from 3.12→2.38 (ruWiki) and 2.70→2.26 (T-Wix), substantially increasing the share of “≤2 token” words and yielding similar compression benefits for other Slavic languages.

Supervised fine-tuning uses the T-Wix corpus: ~500k instructions, with 468k general and 30k explicit reasoning tasks. The general subset is filtered for diversity and complexity (across Math, Code, Science, etc.), while the reasoning set is derived from English seed tasks, translated, augmented with teacher (Qwen3-235B) traces, and pruned for “zone of proximal development” alignment. SFT is performed for 2 epochs at 32k context, using Adam(1e-6), cosine LR decay, 0.1 warmup, BF16, batch 32.

Preference tuning via Direct Preference Optimization (DPO) is run on 100k instructions, using model-sampled completions to create best/worst pairs, with DPO β=0.5, AdamW(1e-7), and 1 epoch. This improves alignment for both direct and reasoning answers without creating distribution shift.

2. Hybrid Reasoning Modes and Behavioral Control

T-pro 2.0 uniquely supports dual generation modes in a single checkpoint. In standard mode, the model outputs direct, concise answers. When a “chain-of-thought” is explicitly requested or a “think” flag is set, it generates stepwise, teacher-guided reasoning traces. Both direct and traced samples are present in SFT and DPO data, enabling robust switching between behaviors without separate weights.

At inference, toggling a reasoning mode flag enables transparent switching between direct and explicit reasoning generation. This is exploited in both API and web demo deployments.

3. Speculative Inference: EAGLE Pipeline and Acceleration

Low-latency inference utilizes an adapted EAGLE speculative-decoding protocol. The draft model is a one-layer Llama-2 decoder (with FR-Spec head) trained to minimize a composite loss: L=λMAEhϕhθ1+λMSEhϕhθ22+λKLDKL(pϕpθ)\mathcal{L} = \lambda_{\text{MAE}} \|\mathbf{h}_\phi - \mathbf{h}_\theta\|_1 + \lambda_{\text{MSE}} \|\mathbf{h}_\phi - \mathbf{h}_\theta\|_2^2 + \lambda_{KL} D_\mathrm{KL}(p_\phi \| p_\theta) where terms correspond to losses on hidden states and token distributions. Training is performed for 4 epochs on 8 × H100 (BF16, TF32).

The speculative decoding algorithm samples kk tokens from the draft model, checks acceptability via likelihood ratios with the frozen 32B verifier, and accepts or falls back on the main model as: Speedup=1+kρk+1\text{Speedup} = \frac{1 + k}{\rho k + 1} with ρ\rho the acceptance rate. This dynamic tree protocol is integrated through SGLang to serve all production endpoints.

Empirically, batch=1/temperature=0.8 settings realize speedups ranging from 1.57× to 2.25× across benchmarks (see below).

4. Evaluation: Benchmarks and Domain Performance

T-pro 2.0 is evaluated on a broad suite of Russian and English benchmarks:

  • T-Math (331 All-Russian and Moscow math olympiad problems, pass@1, numeric answer): 0.541 (Qwen3-32B baseline 0.529; best closed models o4-mini-high 0.73, DeepSeek-R1 0.71).
  • Russian knowledge and reasoning:
    • MERA: 0.66 (Qwen3 0.582, YandexGPT-5 0.642 GPU-4o)
    • ruMMLU-Pro: 0.697 (Qwen3 0.677, GPT-4o 0.714)
    • Dialogue (Arena Hard Ru): 91.1/90.36 (reasoning/non; Qwen3 83.95/84.66)
    • ruAIME 2024/25: 0.704/0.646 (DeepSeek-V3 0.319/0.285; GPT-4o 0.090/0.069)
    • ruMATH-500: 0.94 (Qwen3 0.938)
  • English benchmarks:
    • AIME 2024/25: 0.765/0.679 (Qwen3 0.808/0.725)
    • MATH-500: 0.966 (Qwen3 0.961)
    • GPQA Diamond: 0.641 (Qwen3 0.668)
    • LiveCodeBench: 0.556 (Qwen3 0.546)

Speculative decoding speedup (acceptance rate ≈3.39) nearly doubles throughput in STEM (TPS from ~108→216), with humanities, code, and reasoning tasks also accelerated.

5. Released Resources and Ecosystem

The T-pro 2.0 release is fully reproducible and extensible, with all essential artifacts published:

Resource Type Location License
T-pro 2.0 Model weights https://huggingface.co/t-tech/T-pro-it-2.0 Apache-2.0
EAGLE weights Draft model https://huggingface.co/t-tech/T-pro-it-2.0-eagle Apache-2.0
T-Wix 500k SFT dataset https://huggingface.co/datasets/t-tech/T-Wix ODC-By
T-Math Benchmark https://huggingface.co/datasets/t-tech/T-math Apache-2.0

Code usage is demonstrated via standard Hugging Face transformers or SGLang for speculative decoding:

1
2
3
4
5
6
7
8
9
10
11
12
13
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("t-tech/T-pro-it-2.0", trust_remote_code=True)
model     = AutoModelForCausalLM.from_pretrained("t-tech/T-pro-it-2.0", torch_dtype=torch.bfloat16)
out = model.generate(
    **tokenizer("Расскажи анекдот.", return_tensors="pt"),
    max_new_tokens=128, temperature=0.7
)

prompt = "Вычисли: 17*23. Покажи ход решения."
out = model.generate(
    **tokenizer(prompt, return_tensors="pt"),
    max_new_tokens=256, do_sample=True, temperature=0.8
)
For EAGLE:
1
2
3
4
5
6
from sglang import SpeculativeDecoder
spec_decoder = SpeculativeDecoder.from_pretrained(
    "t-tech/T-pro-it-2.0-eagle",
    max_draft_length=4, temperature=0.8
)
response = spec_decoder.generate("Объясни принцип относительности.", max_new_tokens=200)
A public web demo at http://t-pro-2-0.streamlit.app provides model comparisons, reasoning mode toggling, telemetry (latency, tokens/sec), and evaluation across multiple domains.

6. Significance, Impact, and Deployment

T-pro 2.0 establishes a new open baseline for Russian hybrid-reasoning LLMs, distinguished by:

  • Dual-mode reasoning generation fully integrated into a single checkpoint
  • Efficient, Cyrillic-optimized tokenization (with measured compression and token count reductions)
  • Addressing the latency bottleneck for large models via EAGLE speculative decoding, with up to 2× speedup in practical inference
  • High performance on Russian benchmarks, competitive English math/code metrics, and clear documentation of training, evaluation, and tokenizer design
  • Open weights, datasets, fine-tuning, and inference code, facilitating research, transfer learning, and rapid adaptation

The openly available resources and architectural transparency enable rigorous study of Russian-language LLM reasoning, tokenization effects, and inference acceleration. A plausible implication is that release of T-pro 2.0 lowers barriers for subsequent research on Slavic language LLMs and benchmarking, and facilitates end-user applications requiring low-latency, high-accuracy, and controllable reasoning (Stoianov et al., 11 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to T-pro 2.0.