Papers
Topics
Authors
Recent
Search
2000 character limit reached

LiveChatBench: Chat Translation Benchmark

Updated 8 January 2026
  • LiveChatBench is a benchmark for real-time, on-device Korean–English chat translation that handles colloquial, meme-rich internet language.
  • It utilizes a curated dataset from 30 million chat messages with synthetic parallel data and evaluates models via BLEU, ChrF++, and FSP metrics.
  • The benchmark provides actionable insights on model adaptation using LoRA and offers detailed resource profiling for deployment under mobile hardware constraints.

LiveChatBench is a benchmark for real-time, on-device translation of live-stream chat messages, constructed to facilitate performance evaluation and deployment studies of AI LLMs under realistic mobile resource constraints. LiveChatBench specifically targets Korean–English chat translation in live-streaming contexts, with a strong emphasis on colloquial, meme-rich internet language and hardware-aware evaluation. Its development and application address both domain-specific translation challenges and the operational limits inherent to on-device, real-time AI inference (Park et al., 6 Jan 2026).

1. Benchmark Construction and Dataset Properties

LiveChatBench was curated from approximately 30 million Korean chat messages collected from the SOOP live-streaming platform. Key pre-filtering removed “uninformative” content (e.g., emoticons, repeated characters) and discarded messages longer than 50 characters, aligning with observed message length distributions (99.03 % of messages under 50 characters).

Synthetic parallel data were created by translating chat messages to English using GPT-5.1, augmented with a manually built slang/meme dictionary of 656 Korean Internet-specific terms and meme phrases. The pipeline for translation involved:

  • Retrieval of relevant glossary entries via a hybrid BM25/LLM extractor (hybrid micro-recall: 0.8107).
  • Glossary injection in prompts to GPT-5.1 for enhanced translation of domain-specific terms.
  • Professional annotators validating, correcting, and recording background knowledge requirements for each output.

From the synthetic corpus (~1.5 million pairs), 1,000 high-quality Korean–English sentence pairs were selected and human-aligned to create LiveChatBench. The resulting dataset is characterized by:

  • 1,000 parallel sentences.
  • Average Korean-side length: 13.98 tokens.
  • Slang-heavy, colloquial vocabulary structure.
  • Explicit alignment to real-world chat phenomena, e.g., memes and contemporary internet slang.

2. Annotation Pipeline and Knowledge Injection

A central feature of LiveChatBench is the explicit modeling and handling of internet slang and memes, which pose unique translation difficulties. A 656-entry glossary was compiled to support both synthetic translation and later annotation. For each message, relevant glossary terms were retrieved using BM25, a LLM-based entity extractor, and a hybrid scheme to maximize recall.

Translation was performed by prompting GPT-5.1 with background knowledge (i.e., relevant glossary snippets) for each instance. Annotators then assessed translation quality and recorded additional knowledge requirements. This process yielded a gold-standard parallel corpus containing challenging, context-dependent expressions prevalent in live chats.

3. Evaluation Metrics and Resource Profiling Methodology

LiveChatBench defines both translation quality metrics and detailed device-level resource metrics:

  • Translation Metrics:
    • BLEU (up to 4-gram): BLEU=BPexp(n=14wnlogpn)BLEU = BP \cdot \exp\left( \sum_{n=1}^4 w_n \log p_n \right), with uniform wn=0.25w_n = 0.25, brevity penalty BPBP.
    • ChrF++: Character n-gram F-score (n=6n=6).
    • Focus Sentence Prompting (FSP): GPT-5.1 provides categorical error analysis and a score (range 0–100).
  • Resource Metrics:
    • CPU utilization (% of all available cores during inference).
    • Device temperature (mean °C during inference via on-chip thermal sensors).
    • Time-to-First-Token (TTFT) and End-to-End Runtime (full sentence generation latency).
    • Memory footprint, broken into model weights, key–value (KV) cache, and compute buffers.
    • Energy per token; e.g., Gemma-3-270M: 0.027 J/token; Gemma-3-1B: 0.10 J/token.

All resource metrics are reported for both CPU-only and mobile GPU-accelerated scenarios, with explicit guidance for profiling under realistic deployment conditions (background tasks, varying loads).

4. Model Selection, Domain Adaptation, and Deployment

LiveChatBench was used to evaluate a series of models:

  • MLKit (Google baseline).
  • Gemma-3-270M (270M parameters).
  • Qwen3-0.6B (600M parameters).
  • Gemma-3-1B (1B parameters).
  • Commercial comparators: Google Translate API, GPT-5.1 (cloud LLM).

On-device models were adapted with Low-Rank Adapters (LoRA), using rank r=64r=64, scaling α=32\alpha=32, and targeting modules {q,k,v,o,gate,up_proj,down_proj}\{q, k, v, o, gate, up\_proj, down\_proj\}. Models were trained on 1.5 million synthetic chat pairs with the slang dictionary injected for improved lexical fidelity. Optimization used AdamW, learning rate 2×1042 \times 10^{-4}, dropout 0.05, warmup ratio 0.03, and a maximum sequence length of 256.

5. Experimental Results: Quality and Resource Trade-Offs

Translation performance on LiveChatBench demonstrated that, with LoRA fine-tuning and domain-adapted data, on-device models approach the quality of large cloud LLMs:

Method BLEU ChrF++ FSP
MLKit 0.0489 17.44 28.26
Google Translate API 0.1678 34.30 59.73
GPT-5.1 0.2679 45.26 70.22
Gemma-3-270M (ft) 0.2485 43.94 62.93
Qwen3-0.6B (ft) 0.2689 45.75 65.97
Gemma-3-1B (ft) 0.2978 48.44 67.87

Smaller models (e.g., Gemma-3-270M) with LoRA fine-tuning achieve BLEU within 0.02 of GPT-5.1 at ~5× lower latency and ~3× lower energy per token. Larger models provide modest BLEU/ChrF++ improvements but incur significant resource costs (e.g., 3–5× higher energy, up to 60% CPU in the absence of GPU acceleration).

Time-to-First-Token (TTFT) and comprehensive runtime studies highlighted that, on modern GPU-equipped mobile devices, Gemma-3-270M achieves TTFT under 40 ms with low CPU utilization (~5%). On older or CPU-only devices, Gemma-3-270M remains operational, while larger models may exceed memory or thermal constraints, resulting in out-of-memory (OOM) or thermal throttling.

6. Deployment Guidelines and Insights

Empirical findings establish key practical recommendations:

  • Model selection: For GPU-equipped devices (Apple Family 7+/modern Snapdragon), deploy Gemma-3-270M for optimal latency/energy/quality; on CPU-only or legacy hardware, select the smallest viable model to avoid thermal throttling and OOM.
  • Domain adaptation: LoRA fine-tuning on domain-specific data with injected slang dictionary is essential for chat translation quality.
  • Resource management: Profile inference under live conditions, actively monitoring CPU%, temperature, and latency; implement adaptive fallback strategies to cloud or smaller models as necessary.
  • Energy efficiency: At 0.027 J/token (Gemma-3-270M), a typical smartphone battery (65,376 J) supports ~2.42 million tokens per charge, while larger models (1B) reduce this by a factor of ~3.7.

Domain mismatch limits generalization to broader translation tasks (FLORES-200/WMT24++), as chat translation contains shorter, slang-dense utterances incompatible with standard benchmarks.

7. Significance and Future Directions

LiveChatBench provides the first public, real-world benchmark focused on chat translation within live-streaming, enabling reproducible mobile evaluation protocols and model selection criteria under strict device constraints. It complements corpus-centric datasets and workflows such as LiveChat (Gao et al., 2023) by supplying infrastructure for evaluating mobile real-time translation systems.

A plausible implication is that the integration of domain-specific glossaries and parameter-efficient adaptation (such as LoRA) materially closes the gap between on-device models and cloud LLMs for targeted chat domains, while drawing attention to operational limitations including heterogeneity in device hardware, resource balancing, and fallback mechanisms. Further research may address multi-modal context and cross-platform generalization to more languages and live-streaming platforms, as well as the refinement of metrics for multi-party, conversational integrity.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LiveChatBench.