LiveChatBench: Chat Translation Benchmark

Updated 8 January 2026

LiveChatBench is a benchmark for real-time, on-device Korean–English chat translation that handles colloquial, meme-rich internet language.
It utilizes a curated dataset from 30 million chat messages with synthetic parallel data and evaluates models via BLEU, ChrF++, and FSP metrics.
The benchmark provides actionable insights on model adaptation using LoRA and offers detailed resource profiling for deployment under mobile hardware constraints.

LiveChatBench is a benchmark for real-time, on-device translation of live-stream chat messages, constructed to facilitate performance evaluation and deployment studies of AI LLMs under realistic mobile resource constraints. LiveChatBench specifically targets Korean–English chat translation in live-streaming contexts, with a strong emphasis on colloquial, meme-rich internet language and hardware-aware evaluation. Its development and application address both domain-specific translation challenges and the operational limits inherent to on-device, real-time AI inference (Park et al., 6 Jan 2026).

1. Benchmark Construction and Dataset Properties

LiveChatBench was curated from approximately 30 million Korean chat messages collected from the SOOP live-streaming platform. Key pre-filtering removed “uninformative” content (e.g., emoticons, repeated characters) and discarded messages longer than 50 characters, aligning with observed message length distributions (99.03 % of messages under 50 characters).

Synthetic parallel data were created by translating chat messages to English using GPT-5.1, augmented with a manually built slang/meme dictionary of 656 Korean Internet-specific terms and meme phrases. The pipeline for translation involved:

Retrieval of relevant glossary entries via a hybrid BM25/LLM extractor (hybrid micro-recall: 0.8107).
Glossary injection in prompts to GPT-5.1 for enhanced translation of domain-specific terms.
Professional annotators validating, correcting, and recording background knowledge requirements for each output.

From the synthetic corpus (~1.5 million pairs), 1,000 high-quality Korean–English sentence pairs were selected and human-aligned to create LiveChatBench. The resulting dataset is characterized by:

1,000 parallel sentences.
Average Korean-side length: 13.98 tokens.
Slang-heavy, colloquial vocabulary structure.
Explicit alignment to real-world chat phenomena, e.g., memes and contemporary internet slang.

2. Annotation Pipeline and Knowledge Injection

A central feature of LiveChatBench is the explicit modeling and handling of internet slang and memes, which pose unique translation difficulties. A 656-entry glossary was compiled to support both synthetic translation and later annotation. For each message, relevant glossary terms were retrieved using BM25, a LLM-based entity extractor, and a hybrid scheme to maximize recall.

Translation was performed by prompting GPT-5.1 with background knowledge (i.e., relevant glossary snippets) for each instance. Annotators then assessed translation quality and recorded additional knowledge requirements. This process yielded a gold-standard parallel corpus containing challenging, context-dependent expressions prevalent in live chats.

3. Evaluation Metrics and Resource Profiling Methodology

LiveChatBench defines both translation quality metrics and detailed device-level resource metrics:

Translation Metrics:
- BLEU (up to 4-gram): $BLEU = BP \cdot \exp\left( \sum_{n=1}^4 w_n \log p_n \right)$ , with uniform $w_n = 0.25$ , brevity penalty $BP$ .
- ChrF++: Character n-gram F-score ( $n=6$ ).
- Focus Sentence Prompting (FSP): GPT-5.1 provides categorical error analysis and a score (range 0–100).
Resource Metrics:
- CPU utilization (% of all available cores during inference).
- Device temperature (mean °C during inference via on-chip thermal sensors).
- Time-to-First-Token (TTFT) and End-to-End Runtime (full sentence generation latency).
- Memory footprint, broken into model weights, key–value (KV) cache, and compute buffers.
- Energy per token; e.g., Gemma-3-270M: 0.027 J/token; Gemma-3-1B: 0.10 J/token.

All resource metrics are reported for both CPU-only and mobile GPU-accelerated scenarios, with explicit guidance for profiling under realistic deployment conditions (background tasks, varying loads).

4. Model Selection, Domain Adaptation, and Deployment

LiveChatBench was used to evaluate a series of models:

MLKit (Google baseline).
Gemma-3-270M (270M parameters).
Qwen3-0.6B (600M parameters).
Gemma-3-1B (1B parameters).
Commercial comparators: Google Translate API, GPT-5.1 (cloud LLM).

On-device models were adapted with Low-Rank Adapters (LoRA), using rank $r=64$ , scaling $\alpha=32$ , and targeting modules $\{q, k, v, o, gate, up\_proj, down\_proj\}$ . Models were trained on 1.5 million synthetic chat pairs with the slang dictionary injected for improved lexical fidelity. Optimization used AdamW, learning rate $2 \times 10^{-4}$ , dropout 0.05, warmup ratio 0.03, and a maximum sequence length of 256.

5. Experimental Results: Quality and Resource Trade-Offs

Translation performance on LiveChatBench demonstrated that, with LoRA fine-tuning and domain-adapted data, on-device models approach the quality of large cloud LLMs:

Method	BLEU	ChrF++	FSP
MLKit	0.0489	17.44	28.26
Google Translate API	0.1678	34.30	59.73
GPT-5.1	0.2679	45.26	70.22
Gemma-3-270M (ft)	0.2485	43.94	62.93
Qwen3-0.6B (ft)	0.2689	45.75	65.97
Gemma-3-1B (ft)	0.2978	48.44	67.87

Smaller models (e.g., Gemma-3-270M) with LoRA fine-tuning achieve BLEU within 0.02 of GPT-5.1 at ~5× lower latency and ~3× lower energy per token. Larger models provide modest BLEU/ChrF++ improvements but incur significant resource costs (e.g., 3–5× higher energy, up to 60% CPU in the absence of GPU acceleration).

Time-to-First-Token (TTFT) and comprehensive runtime studies highlighted that, on modern GPU-equipped mobile devices, Gemma-3-270M achieves TTFT under 40 ms with low CPU utilization (~5%). On older or CPU-only devices, Gemma-3-270M remains operational, while larger models may exceed memory or thermal constraints, resulting in out-of-memory (OOM) or thermal throttling.

6. Deployment Guidelines and Insights

Empirical findings establish key practical recommendations:

Model selection: For GPU-equipped devices (Apple Family 7+/modern Snapdragon), deploy Gemma-3-270M for optimal latency/energy/quality; on CPU-only or legacy hardware, select the smallest viable model to avoid thermal throttling and OOM.
Domain adaptation: LoRA fine-tuning on domain-specific data with injected slang dictionary is essential for chat translation quality.
Resource management: Profile inference under live conditions, actively monitoring CPU%, temperature, and latency; implement adaptive fallback strategies to cloud or smaller models as necessary.
Energy efficiency: At 0.027 J/token (Gemma-3-270M), a typical smartphone battery (65,376 J) supports ~2.42 million tokens per charge, while larger models (1B) reduce this by a factor of ~3.7.

Domain mismatch limits generalization to broader translation tasks (FLORES-200/WMT24++), as chat translation contains shorter, slang-dense utterances incompatible with standard benchmarks.

7. Significance and Future Directions

LiveChatBench provides the first public, real-world benchmark focused on chat translation within live-streaming, enabling reproducible mobile evaluation protocols and model selection criteria under strict device constraints. It complements corpus-centric datasets and workflows such as LiveChat (Gao et al., 2023) by supplying infrastructure for evaluating mobile real-time translation systems.

A plausible implication is that the integration of domain-specific glossaries and parameter-efficient adaptation (such as LoRA) materially closes the gap between on-device models and cloud LLMs for targeted chat domains, while drawing attention to operational limitations including heterogeneity in device hardware, resource balancing, and fallback mechanisms. Further research may address multi-modal context and cross-platform generalization to more languages and live-streaming platforms, as well as the refinement of metrics for multi-party, conversational integrity.

Markdown Report Issue Upgrade to Chat

References (2)

An Empirical Study of On-Device Translation for Real-Time Live-Stream Chat on Mobile Devices (2026)

LiveChat: A Large-Scale Personalized Dialogue Dataset Automatically Constructed from Live Streaming (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LiveChatBench.