Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sherkala-Chat (8B): Open Kazakh LLM

Updated 27 February 2026
  • Sherkala-Chat (8B) is an instruction-tuned large language model designed for Kazakh, with strong multilingual capabilities in English and Russian.
  • It incorporates architectural modifications such as an extended vocabulary and multilingual embedding initialization to reduce tokenization cost and improve text representation.
  • Its training regimen uses a large, balanced multilingual corpus and rigorous safety alignment to boost performance and mitigate potential harms.

Sherkala-Chat (8B), formally Llama-3.1-Sherkala-8B-Chat, is an instruction-tuned, open-weight LLM for the Kazakh language. Developed as a derivative of Llama-3.1-8B, the model features dedicated architectural, data-centric, and safety alignment modifications to facilitate high-fidelity language modeling for Kazakh, while maintaining robust performance in English and Russian. Its deployment addresses the historic underrepresentation of Kazakh in contemporary LLM research and provides an open resource for both scientific investigation and real-world applications (Koto et al., 3 Mar 2025).

1. Model Architecture and Parameterization

Sherkala-Chat (8B) is a decoder-only transformer with 8 billion parameters (θR8×109\theta \in \mathbb{R}^{8\times10^{9}}), inheriting a dense, causal attention architecture from Llama-3.1-8B. It utilizes 32 attention heads per layer distributed across 40 decoder blocks. Two critical modifications differentiate Sherkala-Chat from its foundation model. First, its vocabulary was extended by 25% (128,256 \rightarrow 159,766 BPE tokens) through the addition of the most frequent Kazakh, Russian, and Turkish tokens. This adjustment reduced the Kazakh fertility rate (average tokens per word) from 4.73 to 2.04, effectively halving tokenization cost for Kazakh. Second, the new tokens' embeddings were initialized via Wechsel multilingual initialization by averaging the top-5 semantically similar base token embeddings, as determined by OpenAI text-embedding-3-large, for both input and output layers. These changes significantly enhance the model's capacity to represent and generate Kazakh text (Koto et al., 3 Mar 2025).

2. Training Data and Pretraining Regimen

Continued pretraining was performed on a rigorously curated 45.3 billion-token multilingual corpus. Data was sourced and balanced as follows:

Language Group Token Count (×109\times 10^9) Proportion
Kazakh 19.45 43.0%
English 19.45 43.0%
Russian + Turkish 6.40 14.1%
Total 45.3 100%

Sources included open Kazakh text (news, wikis, educational corpora, code, synthetic translations) and high-quality academic content in English, Russian, and Turkish. Preprocessing comprised Unicode and HTML normalization, language-specific filtering, script normalization, removal of URLs/citations/JS, and fuzzy deduplication using locality-sensitive hashing. Ablation studies demonstrated the optimality of a 3 : 1 : 3 mix among Kazakh, Russian+Turkish, and English, preventing catastrophic forgetting of major languages while maximizing Kazakh representation. This suggests a highly data-driven approach to balancing minority and majority language competencies (Koto et al., 3 Mar 2025).

3. Instruction Tuning and Fine-Tuning Protocol

Supervised instruction fine-tuning (SFT) converted the pretrained base into a chat assistant. The instruction datasets comprised:

Language Dialogue Pairs (M)
Kazakh 3.5 (mostly paraphrased, with cultural/government Q&As)
English 3.8 (sourced from Tulu-3, Jais, proprietary sources)
Russian 0.26 (from Grandmaster-PRO-MAX)
Safety Prompts 0.3 total (200K Kazakh, 100K English, adversarial, over-refusal)

Fine-tuning involved formatting dialogues in the Llama-3.1 chat template and training for three epochs over approximately 2.79 billion unique tokens using a batch size of 120 sequences, a peak learning rate of 7.5×1057.5 \times 10^{-5}, 1% linear warm-up, and cosine decay to 1.5×1061.5 \times 10^{-6}, minimizing cross-entropy loss

LCE=tlogpθ(yty<t,x).\mathcal{L}_{CE} = -\sum_{t}\log p_\theta(y_t\mid y_{<t},x).

No curriculum learning was necessary as SFT sufficed for alignment across Kazakh, English, and Russian (Koto et al., 3 Mar 2025).

4. Safety Alignment and Harm Mitigation

Sherkala-Chat’s safety module was trained using a multilingual "Do-Not-Answer" dataset covering 17 harm types (including information hazards, malicious misuse, discrimination, misinformation, and region-specific sensitivities). The dataset included 100K adversarial attack prompts—generated via eight established techniques—and further expanded through human and LLM-augmented Kazakh translations. Safety-focused dialogues were interleaved at a 95% ratio during IFT, encouraging both refusals and nuanced responses. The release criteria required that any model response achieve a safety threshold

Ssafety=#harmless ratings#total prompts0.8,S_{\mathrm{safety}} = \frac{\#\text{harmless ratings}}{\#\text{total prompts}} \ge 0.8,

as determined by rule-based filtering or a lightweight BERT-style safety classifier. Evaluation by GPT-4o produced safety scores of 91.9% (Kazakh), 85.1% (Russian), and 96.0% (English), corroborated by human validation with over 90% agreement (Koto et al., 3 Mar 2025). This suggests effective cross-lingual safety alignment.

5. Evaluation, Benchmarks, and Emerging Capabilities

Sherkala-Chat (8B) was benchmarked against contemporary open models—BLOOM (7.1B), Qwen-2.5 (7B), mGPT (13B), Llama-3.1-Instruct (8B), Llama-3.1-KazLLM-1.0 (8B), Irbis-v0.1 (7B)—on knowledge, reasoning, and misinformation tasks across Kazakh, Russian, and English. Notable Kazakh zero-shot accuracy results include:

Model AVG KazMMLU MMLU BoolQ SIQA HellaSwag
Llama3.1 (8B) 39.8 38.3 31.3 63.7 38.1 37.8
Llama3.1-KazLLM-1.0 (8B) 43.7 37.0 31.5 69.8 44.7 46.0
Sherkala-Chat (8B) 47.6 41.4 34.6 75.8 48.1 55.2

In English, Sherkala-Chat (8B) achieved an average test accuracy of 59.1%, closely following Qwen-2.5-Instruct (62.1%) and Llama-3.1-Instruct (60.1%) on MMLU, PIQA, BoolQ, ARC, OpenBookQA, TruthfulQA, and CrowS-Pairs. In Russian, its average was 32.0%, surpassing Llama3.1-Instruct (31.5%) and narrowing the gap to Qwen-2.5-Instruct (38.5%). Open-ended text generation, assessed by GPT-4o on Vicuna-80 and MT-80 tasks, yielded Kazakh ratings of 5.99 ± 2.73 (MT) and 7.39 ± 1.89 (Vicuna) out of 10—exceeding both Llama3.1-Instruct and KazLLM-1.0 and rivaling proprietary baselines for detailedness, relevance, and grammaticality. This establishes Sherkala-Chat (8B) as the open-weight state-of-the-art for Kazakh (Koto et al., 3 Mar 2025).

6. Capabilities, Limitations, and Language Behaviors

Sherkala-Chat demonstrates strong factuality and reasoning in Kazakh, doubling the performance of prior open models on high-school MMLU and commonsense reasoning. Enhanced vocabulary and embedding initialization produce coherent, context-aware text with low fertility, supporting longer dialogues and lower latency. Targeted safety fine-tuning enables robust handling of sensitive Kazakh-specific topics.

However, there are persistent limitations: the model occasionally hallucinates rare Kazakh names or references fictitious sources, especially under time constraints; some responses reveal mild English-centric biases and preferentially employ English cultural analogies in multilingual settings; and the model’s proficiency in lower-resource Turkic dialects beyond Kazakh and Russian remains limited (Koto et al., 3 Mar 2025). A plausible implication is that further data curation or language-adaptive scaling would be necessary for broader Turkic coverage.

7. Release Strategy and Use Cases

Sherkala-Chat (8B) is released under a CC-BY-NC-SA 4.0 license on the Hugging Face platform (inceptionai/Llama-3.1-Sherkala-8B-Chat), complete with Transformers library compatibility and reference APIs supporting up to 8K context tokens, temperature tuning, and safety filtering. Notable real-world applications include government and civic chatbots delivering public service information in Kazakh, educational tutors for secondary education, bilingual customer support bridging Kazakh and English, summarization of cultural heritage documents, and cross-lingual content moderation for Kazakh social media. Through comprehensive benchmarks, transparent methodology, and open weights, Sherkala-Chat (8B) is positioned as a foundation for ongoing research and deployment in Kazakh-language environments (Koto et al., 3 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sherkala-Chat (8B).