Papers
Topics
Authors
Recent
Search
2000 character limit reached

MultiLexNorm Benchmark & Extensions

Updated 30 January 2026
  • MultiLexNorm is a unified framework that maps non-standard social media tokens to their canonical forms across multiple languages and scripts.
  • It introduces detailed error taxonomies and baseline systems, showcasing state-of-the-art approaches like byte-level ByT5 and modular LLM pipelines.
  • MultiLexNorm++ extends the benchmark to diverse Asian languages, uncovering challenges such as script-specific errors and the need for tailored modeling strategies.

The MultiLexNorm benchmark is a unified evaluation framework for lexical normalization in social media data, initially designed to address noisy, informal language by mapping non-standard tokens to their canonical forms across multiple languages and scripts. Originating with Latin-script Indo-European datasets, MultiLexNorm has recently been extended via MultiLexNorm++ to include typologically diverse Asian languages in distinct scripts, revealing novel challenges for both modeling and evaluation. Comprehensive metrics, error taxonomies, and competitive model baselines define the landscape, with state-of-the-art (SOTA) results driven by byte-level models for Latin scripts and modular LLM pipelines for broader multilingual coverage (Samuel et al., 2021, Buaphet et al., 23 Jan 2026).

1. Benchmark Composition and Data

MultiLexNorm consists of social media datasets sourced from platforms such as Twitter, Instagram, Facebook, DC Inside, and WRIME, spanning multiple languages and scripts. The original 12 datasets (covering 11 languages, including two code-switched sets) are predominantly Indo-European with the Latin alphabet. These datasets feature word-level normalization targets, requiring systems to convert informal or noisy input into canonical text, sometimes involving splits/merges or language-specific edits.

MultiLexNorm++ expands coverage to five Asian languages—Indonesian (Latin script), Japanese (Kanji+Hiragana+Katakana), Korean (Hangul), Thai (Thai script), and Vietnamese (Latin+diacritics). The extended benchmark introduces script-specific error sources (e.g., Thai tone-mark omissions, Korean jamo variants), sociolinguistic phenomena, and diverse token normalization densities (e.g., Indonesian: 47.5% non-standard, Thai: 4.7% normalized) (Buaphet et al., 23 Jan 2026).

Dataset Language Script Tokens % Non-standard/Normalized
Danish DA Latin 11,816 9.25%
German DE Latin 25,157 17.96%
English EN Latin 73,806 6.9%
Indonesian IN Latin 48,716 47.5%
Japanese JA Kanji+Hiragana+Katakana 95,411 7%
Korean KO Hangul 16,618 7.5%
Thai TH Thai 169,751 4.7%
Vietnamese VI Latin+diacritics 128,685 15.98%

Annotation is performed in two passes by native speakers, emphasizing word-level corrections. Inter-annotator agreement (κ) ranges from 0.88 to 0.92 on sampled Asian datasets.

2. Task Definition and Error Taxonomy

The core MultiLexNorm task is defined as mapping an input token sequence x=(x1,...,xn)x = (x_1, ..., x_n) to an output y=(y1,...,yn)y = (y_1, ..., y_n), where each yiy_i is either the canonical form of xix_i (if xix_i is non-standard) or xix_i itself (if already standard).

Normalization types encompass:

  • Orthographic errors (deletion, insertion, substitution, transposition)
  • Accent mark restoration/removal
  • Casing (sentence-initial, named entities)
  • Apostrophe edits
  • Colloquial/nonstandard forms (“g2g”→“got to go”, vowel omission, repeated characters)
  • Language-specific expansions/reversals (e.g., Indonesian “laki2nya”→“laki-lakinya”)
  • Script-specific errors (e.g., Thai tone marks, Korean jamo compositionality)

MultiLexNorm++ introduces error subcategories for Asian scripts: dialectal/phonetically-driven variants, foreign loanword transliterations, and script-native phenomena (e.g., non-standard kana use in Japanese) (Buaphet et al., 23 Jan 2026).

3. Evaluation Metrics and Methodologies

Intrinsic evaluation leverages Error-Reduction Rate (ERR), which normalizes raw accuracy against a leave-as-is (LAI) baseline:

ERR=AsysAlai1Alai\mathrm{ERR} = \frac{A_\mathrm{sys} - A_\mathrm{lai}}{1 - A_\mathrm{lai}}

where AsysA_\mathrm{sys} is system-level accuracy, AlaiA_\mathrm{lai} is baseline accuracy. Macro-averaging across languages yields the final rank.

For precision and recall on normalized tokens:

F1=2PRP+RF_1 = 2\frac{P \cdot R}{P + R}

Additional metrics include cross-entropy loss for models optimized via likelihood.

Extrinsic evaluation involves dependency parsing: normalized corpora are parsed with MaChAmp using Universal Dependencies (UD) treebanks. Parsing quality is measured by Label Attachment Score (LAS), macro-averaged over test sets for each submission (Samuel et al., 2021).

4. Baseline Systems and State-of-the-Art Models

MultiLexNorm has established several baselines:

  • LAI (Leave-As-Is): unchanged input tokens.
  • MFR (Most Frequent Replacement): majority-vote replacement per token type.
  • MoNoise: hybrid (Aspell, FastText, handcrafted heuristics).

ÚFAL’s ByT5-based approach established initial SOTA on MultiLexNorm:

  • ByT5 (byte-level T5, 12 layers, hidden size 768, ~300M params) fine-tuned on synthetic and authentic normalization data.
  • Synthetic pre-training involves noise induction from Wikipedia, parameterized by corpus-observed edit probabilities (e.g., accent removal/addition, apostrophe manipulation).
  • Input is fed word-by-word wrapped in sentinel tokens; output may be single-token or multi-token expansion.

Performance summary (Macro-ERR 12 languages): | Team | Avg ERR (%) | Top Single Language | |------------------|-------------|--------------------| | ByT5 (ensemble) | 67.3 | Slovenian (80.1) | | HEL-LJU | 53.6 | Slovenian (67.0) | | MoNoise | 49.0 | English (74.3) | | MFR | 38.4 | English (64.9) | | LAI | 0.0 | – |

ByT5 demonstrated robust accuracy on Latin scripts (ERR boosts up to +25.6 over mT5), with further gains via synthetic pre-training (+5.7), train+dev fine-tuning (+1.3), and ensemble averaging (+1.1) (Samuel et al., 2021).

In MultiLexNorm++, ByT5’s performance on Asian scripts degrades (e.g., ERR < 0 on Korean), necessitating new architectures:

  • Modular pipeline: (i) token-level detection (XLM-R), (ii) low-entropy dictionary lookup, (iii) in-context LLM normalization (Gemma-3-27B-it, Llama-3-70B, Qwen2.5-72B, GPT-4o).
  • Prompted in-context learning with few-shot examples; greedy or beam decoding.
  • No supervised fine-tuning; relies on model generalization and prompt engineering (Buaphet et al., 23 Jan 2026).

5. Experimental Results

On the original MultiLexNorm 12-language suite, ByT5 (ensemble) remains SOTA, achieving 67.3% macro-ERR and 64.17% LAS in dependency parsing. For Asian languages added in MultiLexNorm++, performance metrics demonstrate the following (ERR, F1):

Language ByT5 (ÚFAL) GPT-4o LLaMA3 Qwen2.5 Gemma3
Indonesian 67.94 68.35 63.24 64.73 66.45
Japanese 39.24 22.52 16.27 20.34 14.66
Korean –2.38 –8.73 –14.82 –6.88 –8.73
Thai 6.90 43.95 40.39 42.23 43.09
Vietnamese 77.35 80.04 76.41 77.79 77.57

LLM-based models produce more robust results on non-Latin scripts compared to ByT5, especially for Thai and Vietnamese, but performance in Japanese and Korean remains low (a plausible implication is the need for script-specific representations and training).

6. Error Analysis and Limitations

Error breakdowns on new languages reveal:

  • Wrong normalization choice dominates errors.
  • Over-normalization (unwarranted token substitution) and under-normalization (missed corrections) are frequent.
  • Detection misses (tokens not flagged for normalization in upstream sequence labeling).

Manual analysis indicates remaining challenges in spelling, phonetic variants, loanword transliterations, abbreviations, and informal slang. LLMs exhibit differential strengths on abbreviations and loanwords (Qwen2.5) but struggle on phonetic and dialectal errors.

Limitations of current SOTA include:

  • Byte-level modeling degrades substantially on non-Latin scripts, evidenced by negative ERR on Korean.
  • Efficiency constraints, as ByT5 encodes each token independently (≈90 words/s on RTX 3090).
  • Synthetic pre-training relies on accurately estimating per-language error probabilities, which can be prohibitive for low-resource settings.

7. Open Problems and Future Directions

Key directions for further research highlighted by MultiLexNorm++ include:

  • Script-specific model architectures (e.g., sub-character, stroke-level representations for languages such as Japanese, Korean, Thai).
  • Multilingual pretraining on naturally "noisy" social media data, replacing synthetic noise induction.
  • Joint training for normalization plus downstream tasks (NER, parsing), to improve overall robustness.
  • Expanding the normalization task to sentence-level or document-level mappings to capture context-dependent splits/merges.

MultiLexNorm and MultiLexNorm++ collectively demonstrate the necessity of adaptive modeling, sophisticated error taxonomies, and targeted augmentation for robust lexical normalization in multilingual and multi-script social media data (Samuel et al., 2021, Buaphet et al., 23 Jan 2026).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MultiLexNorm Benchmark.