Kanana: Compute-efficient Bilingual Language Models (2502.18934v3)

Published 26 Feb 2025 in cs.CL and cs.LG

Abstract: We introduce Kanana, a series of bilingual LLMs that demonstrate exceeding performance in Korean and competitive performance in English. The computational cost of Kanana is significantly lower than that of state-of-the-art models of similar size. The report details the techniques employed during pre-training to achieve compute-efficient yet competitive models, including high quality data filtering, staged pre-training, depth up-scaling, and pruning and distillation. Furthermore, the report outlines the methodologies utilized during the post-training of the Kanana models, encompassing supervised fine-tuning and preference optimization, aimed at enhancing their capability for seamless interaction with users. Lastly, the report elaborates on plausible approaches used for LLM adaptation to specific scenarios, such as embedding, retrieval augmented generation, and function calling. The Kanana model series spans from 2.1B to 32.5B parameters with 2.1B models (base, instruct, embedding) publicly released to promote research on Korean LLMs.

Summary

The paper introduces Kanana, bilingual models achieving over 11% compute savings in pre-training through staged training, up-scaling, and iterative pruning techniques.
Kanana uses rigorously filtered bilingual datasets to achieve competitive performance on benchmarks like MMLU, KMMLU, and HAE-RAE, demonstrating particular strength in Korean-specific tasks.
Kanana utilizes a comprehensive post-training pipeline, including SFT and preference optimization, for robust performance across diverse NLP applications and domain-specific tasks.

This paper presents Kanana, a bilingual LLM family that emphasizes compute efficiency through innovative training and adaptation techniques.

It reduces pre-training compute by employing staged pre-training, depth up-scaling, and iterative pruning/distillation, saving over 11% of resources compared to training from scratch.
It leverages rigorously filtered bilingual datasets to achieve competitive performance on benchmarks such as MMLU, KMMLU, and HAE-RAE while excelling in Korean-specific tasks.
Its comprehensive post-training pipeline—including supervised fine-tuning, offline and online preference optimization, and domain-specific adaptations for embedding, retrieval, and function calling—ensures robust performance across diverse NLP applications.