Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mi:dm 2.0: Korea-Centric Bilingual LLM

Updated 21 January 2026
  • Mi:dm 2.0 is a Korea-centric bilingual LLM that uniquely integrates Korean linguistic specificity and cultural nuances with modern Transformer technology.
  • Its architecture leverages a decoder-only Transformer with SiLU activations, grouped-query attention, and supports a 32K token context window in Base and Mini variants.
  • A custom tokenizer, curated data pipeline, and domain-aware curriculum enable superior performance on Korean-centric benchmarks and diverse application domains.

Mi:dm 2.0 is a Korea-centric bilingual LLM engineered to integrate Korean cultural and linguistic specificity with SOTA general-purpose LLM technology. Developed to address the systemic underrepresentation and poor curation of Korean data in existing models, Mi:dm 2.0 offers both architectural and data-centric innovations. It is released in two configurations—Base (11.5B parameters) and Mini (2.3B parameters)—both founded on a @@@@1@@@@ backbone and equipped to process Korean and English with nuanced cultural comprehension, making it especially suitable for domestic industry, public service, and academic applications (Shin et al., 14 Jan 2026).

1. Model Architecture and Scaling Strategy

Mi:dm 2.0 uses a decoder-only Transformer with SiLU activations, rotary position embeddings (RoPE, base frequency 8×1068\times10^6), grouped-query attention (GQA), and supports a 32K token context window. Both Base and Mini configurations comprise 48 layers and 32 attention heads, differing primarily in hidden and feedforward dimensions:

Variant Layers (L) dmodeld_\mathrm{model} dffd_\mathrm{ff} #Heads V|V| (Vocab Size)
Base (11.5B) 48 4096 14,336 32 131,384
Mini (2.3B) 48 1792 4,608 32 131,392

The parameter count is given by

Nparams=L(2d2+4ddff)+d×V+d×VN_\mathrm{params} = L \left(2d^2 + 4d\,d_\mathrm{ff} \right) + d\times|V| + d\times|V|

where LL is the number of layers, dd the model dimension, dffd_\mathrm{ff} the MLP dimension, and V|V| the vocabulary size.

Mi:dm 2.0 Base leverages “Depth-Up Scaling” (DuS), starting from a 32-layer, ~8B parameter backbone: 16 layers (with cosine similarity >0.99>0.99 between consecutive embeddings) between layers 7–29 are duplicated, resulting in a 48-layer, 11.5B-parameter model. This exploits “representational islands” to achieve parameter efficiency without full retraining.

2. Data Pipeline and Curriculum Learning

The data pipeline is engineered for balanced coverage and high quality, emphasizing Korean-centric content and domain representation:

  • Organic data (~85.7%): Sourced from Common Crawl using an eight-stage Korean-specific preprocessing (TF–IDF deduplication, punctuation heuristics, n-gram filtering, Unicode fixes, ensemble quality and toxicity classifiers, line deduplication, and PII anonymization), as well as licensed news, books, Korean government documents, academic texts, AIHub, and NIKL datasets.
  • Synthetic augmentation (~14%): Enriches underrepresented fields (STEM, applied science, culture) using high-fidelity translations, textbook-style Korean chain-of-thought (“LongCoT”) data, two-stage LLM regeneration (topic extraction + rewriting), and Koreanized entrance-exam style data from English sources.

A domain-aware curriculum is adopted: each document is tagged by a classifier into one of six top-level domains (Humanity, STEM, Applied Science, Health & Food, Life & Culture, ETC). Stage 2 pre-training interleaves data sources via a linear schedule:

αorg(t)=1tT,αsyn(t)=tT,0tT\alpha_\mathrm{org}(t) = 1 - \frac{t}{T}, \quad \alpha_\mathrm{syn}(t) = \frac{t}{T}, \quad 0 \le t \le T

This fosters general linguistic grounding in early steps and targeted domain skills in later training.

3. Tokenizer and Morphological Adaptation

A custom byte-pair encoding (BPE) tokenizer is constructed on the full pretraining corpus to address the agglutinative, morphologically rich character of Korean:

  • Joint Korean-English training with morphological pre-segmentation for Korean, splitting affixes and compound words.
  • Vocab size: ~131.4K tokens, OOV (out-of-vocabulary) rate <0.1%<0.1\% on a held-out set:

OOV rate=1#{tokens in V}#{total tokens in test}<0.001.\mathrm{OOV~rate} = 1 - \frac{\#\{\text{tokens in }V\}}{\#\{\text{total tokens in test}\}} < 0.001.

  • Token compression: Achieves roughly 10–20% higher compression (tokens per character) than standard GPT-style Korean tokenizers, translating to reduced memory and computation costs.

4. Training Regimen and Compute Infrastructure

Mi:dm 2.0 was trained on Microsoft Azure CycleCloud with NVIDIA H100 GPUs using mixed-precision (FP16). Key hyperparameters include:

  • Optimizer: AdamW (β1=0.9\beta_1 = 0.9, β2=0.95\beta_2 = 0.95, ϵ=108\epsilon=10^{-8})
  • LR schedule: Linear warm-up (1% steps), constant plateau, linear decay during the final 10% of steps, with ηmax=3×104\eta_{\max} = 3 \times 10^{-4}.
  • Batching: Token-based micro-batches (\sim1M tokens per GPU), facilitating efficient tensor/data/pipeline parallelism.
  • Context tuning: Additional 2K steps at constant LR (1×1051 \times 10^{-5}) with sequences up to 65K tokens to fully exploit extended RoPE.
  • Compute: Estimated 1.74×10231.74 \times 10^{23} FLOPs (Base) and 4.57×10214.57 \times 10^{21} FLOPs (Mini).

5. Benchmarking and Empirical Performance

Mi:dm 2.0 demonstrates SOTA or near-SOTA performance on multiple Korean-centric and bilingual benchmarks, summarized here:

Model KMMLU (5-shot EM) HAERAE (3-shot acc)
Mi:dm 2.0 Base 47.67% 78.19%
Mi:dm 2.0 Mini 32.35% 52.80%
Exaone-3.5-7.8B 40.70% 60.30%
Qwen3-14B (EN) 82.70% 75.40%

On proprietary KT benchmarks:

  • Ko-Pragmatics, Ko-Referential, Ko-Sovereign: Mi:dm 2.0 Base achieves 70.8% on Ko-Referential-Hard and 53.0% on Ko-Sovereign comprehension, surpassing both domestic and international peers by up to +17 percentage points in the most challenging inference tasks.

6. Cultural Alignment and Application Domains

Mi:dm 2.0 encodes values, reasoning patterns, and societal norms specific to Korean society:

  • Hierarchical domain taxonomy reflects the structure of Korean culture (history, folk traditions, honorifics).
  • Synthetic narratives and training samples are curated to represent authentic Korean scenarios, including exam-style questions, textbook content, and bureaucratic/legal language.
  • Safety and tone controls enforce appropriate use of politeness (yo-form, ha-form) and prevent departures from local ethical norms.

Principal use cases include:

  • Industry chatbots for telecommunications and finance, tuned to Korean customer service etiquette.
  • Automated tutors for Korean humanities, exam prep, and literature.
  • Public service virtual assistants (“민원 안내 챗봇”) fluent in legal and administrative language.
  • Korean-English cross-lingual tasks and as an NLP foundation stack.

7. Licensing and Availability

Mi:dm 2.0 is released under the MIT license to encourage unrestricted research and commercial adoption within the Korean AI ecosystem. Models are accessible via the HuggingFace Hub at https://huggingface.co/K-intelligence; technical correspondence is directed to the developers at [email protected]. This collaborative approach supports not only immediate deployment across Korean industry, education, and public services, but also the broader agenda of fostering “K-intelligence” (Shin et al., 14 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mi:dm 2.0.