LiveChatBench: Chat Translation Benchmark
- LiveChatBench is a benchmark for real-time, on-device Korean–English chat translation that handles colloquial, meme-rich internet language.
- It utilizes a curated dataset from 30 million chat messages with synthetic parallel data and evaluates models via BLEU, ChrF++, and FSP metrics.
- The benchmark provides actionable insights on model adaptation using LoRA and offers detailed resource profiling for deployment under mobile hardware constraints.
LiveChatBench is a benchmark for real-time, on-device translation of live-stream chat messages, constructed to facilitate performance evaluation and deployment studies of AI LLMs under realistic mobile resource constraints. LiveChatBench specifically targets Korean–English chat translation in live-streaming contexts, with a strong emphasis on colloquial, meme-rich internet language and hardware-aware evaluation. Its development and application address both domain-specific translation challenges and the operational limits inherent to on-device, real-time AI inference (Park et al., 6 Jan 2026).
1. Benchmark Construction and Dataset Properties
LiveChatBench was curated from approximately 30 million Korean chat messages collected from the SOOP live-streaming platform. Key pre-filtering removed “uninformative” content (e.g., emoticons, repeated characters) and discarded messages longer than 50 characters, aligning with observed message length distributions (99.03 % of messages under 50 characters).
Synthetic parallel data were created by translating chat messages to English using GPT-5.1, augmented with a manually built slang/meme dictionary of 656 Korean Internet-specific terms and meme phrases. The pipeline for translation involved:
- Retrieval of relevant glossary entries via a hybrid BM25/LLM extractor (hybrid micro-recall: 0.8107).
- Glossary injection in prompts to GPT-5.1 for enhanced translation of domain-specific terms.
- Professional annotators validating, correcting, and recording background knowledge requirements for each output.
From the synthetic corpus (~1.5 million pairs), 1,000 high-quality Korean–English sentence pairs were selected and human-aligned to create LiveChatBench. The resulting dataset is characterized by:
- 1,000 parallel sentences.
- Average Korean-side length: 13.98 tokens.
- Slang-heavy, colloquial vocabulary structure.
- Explicit alignment to real-world chat phenomena, e.g., memes and contemporary internet slang.
2. Annotation Pipeline and Knowledge Injection
A central feature of LiveChatBench is the explicit modeling and handling of internet slang and memes, which pose unique translation difficulties. A 656-entry glossary was compiled to support both synthetic translation and later annotation. For each message, relevant glossary terms were retrieved using BM25, a LLM-based entity extractor, and a hybrid scheme to maximize recall.
Translation was performed by prompting GPT-5.1 with background knowledge (i.e., relevant glossary snippets) for each instance. Annotators then assessed translation quality and recorded additional knowledge requirements. This process yielded a gold-standard parallel corpus containing challenging, context-dependent expressions prevalent in live chats.
3. Evaluation Metrics and Resource Profiling Methodology
LiveChatBench defines both translation quality metrics and detailed device-level resource metrics:
- Translation Metrics:
- BLEU (up to 4-gram): , with uniform , brevity penalty .
- ChrF++: Character n-gram F-score ().
- Focus Sentence Prompting (FSP): GPT-5.1 provides categorical error analysis and a score (range 0–100).
- Resource Metrics:
- CPU utilization (% of all available cores during inference).
- Device temperature (mean °C during inference via on-chip thermal sensors).
- Time-to-First-Token (TTFT) and End-to-End Runtime (full sentence generation latency).
- Memory footprint, broken into model weights, key–value (KV) cache, and compute buffers.
- Energy per token; e.g., Gemma-3-270M: 0.027 J/token; Gemma-3-1B: 0.10 J/token.
All resource metrics are reported for both CPU-only and mobile GPU-accelerated scenarios, with explicit guidance for profiling under realistic deployment conditions (background tasks, varying loads).
4. Model Selection, Domain Adaptation, and Deployment
LiveChatBench was used to evaluate a series of models:
- MLKit (Google baseline).
- Gemma-3-270M (270M parameters).
- Qwen3-0.6B (600M parameters).
- Gemma-3-1B (1B parameters).
- Commercial comparators: Google Translate API, GPT-5.1 (cloud LLM).
On-device models were adapted with Low-Rank Adapters (LoRA), using rank , scaling , and targeting modules . Models were trained on 1.5 million synthetic chat pairs with the slang dictionary injected for improved lexical fidelity. Optimization used AdamW, learning rate , dropout 0.05, warmup ratio 0.03, and a maximum sequence length of 256.
5. Experimental Results: Quality and Resource Trade-Offs
Translation performance on LiveChatBench demonstrated that, with LoRA fine-tuning and domain-adapted data, on-device models approach the quality of large cloud LLMs:
| Method | BLEU | ChrF++ | FSP |
|---|---|---|---|
| MLKit | 0.0489 | 17.44 | 28.26 |
| Google Translate API | 0.1678 | 34.30 | 59.73 |
| GPT-5.1 | 0.2679 | 45.26 | 70.22 |
| Gemma-3-270M (ft) | 0.2485 | 43.94 | 62.93 |
| Qwen3-0.6B (ft) | 0.2689 | 45.75 | 65.97 |
| Gemma-3-1B (ft) | 0.2978 | 48.44 | 67.87 |
Smaller models (e.g., Gemma-3-270M) with LoRA fine-tuning achieve BLEU within 0.02 of GPT-5.1 at ~5× lower latency and ~3× lower energy per token. Larger models provide modest BLEU/ChrF++ improvements but incur significant resource costs (e.g., 3–5× higher energy, up to 60% CPU in the absence of GPU acceleration).
Time-to-First-Token (TTFT) and comprehensive runtime studies highlighted that, on modern GPU-equipped mobile devices, Gemma-3-270M achieves TTFT under 40 ms with low CPU utilization (~5%). On older or CPU-only devices, Gemma-3-270M remains operational, while larger models may exceed memory or thermal constraints, resulting in out-of-memory (OOM) or thermal throttling.
6. Deployment Guidelines and Insights
Empirical findings establish key practical recommendations:
- Model selection: For GPU-equipped devices (Apple Family 7+/modern Snapdragon), deploy Gemma-3-270M for optimal latency/energy/quality; on CPU-only or legacy hardware, select the smallest viable model to avoid thermal throttling and OOM.
- Domain adaptation: LoRA fine-tuning on domain-specific data with injected slang dictionary is essential for chat translation quality.
- Resource management: Profile inference under live conditions, actively monitoring CPU%, temperature, and latency; implement adaptive fallback strategies to cloud or smaller models as necessary.
- Energy efficiency: At 0.027 J/token (Gemma-3-270M), a typical smartphone battery (65,376 J) supports ~2.42 million tokens per charge, while larger models (1B) reduce this by a factor of ~3.7.
Domain mismatch limits generalization to broader translation tasks (FLORES-200/WMT24++), as chat translation contains shorter, slang-dense utterances incompatible with standard benchmarks.
7. Significance and Future Directions
LiveChatBench provides the first public, real-world benchmark focused on chat translation within live-streaming, enabling reproducible mobile evaluation protocols and model selection criteria under strict device constraints. It complements corpus-centric datasets and workflows such as LiveChat (Gao et al., 2023) by supplying infrastructure for evaluating mobile real-time translation systems.
A plausible implication is that the integration of domain-specific glossaries and parameter-efficient adaptation (such as LoRA) materially closes the gap between on-device models and cloud LLMs for targeted chat domains, while drawing attention to operational limitations including heterogeneity in device hardware, resource balancing, and fallback mechanisms. Further research may address multi-modal context and cross-platform generalization to more languages and live-streaming platforms, as well as the refinement of metrics for multi-party, conversational integrity.