MultiLexNorm Benchmark & Extensions

Updated 30 January 2026

MultiLexNorm is a unified framework that maps non-standard social media tokens to their canonical forms across multiple languages and scripts.
It introduces detailed error taxonomies and baseline systems, showcasing state-of-the-art approaches like byte-level ByT5 and modular LLM pipelines.
MultiLexNorm++ extends the benchmark to diverse Asian languages, uncovering challenges such as script-specific errors and the need for tailored modeling strategies.

The MultiLexNorm benchmark is a unified evaluation framework for lexical normalization in social media data, initially designed to address noisy, informal language by mapping non-standard tokens to their canonical forms across multiple languages and scripts. Originating with Latin-script Indo-European datasets, MultiLexNorm has recently been extended via MultiLexNorm++ to include typologically diverse Asian languages in distinct scripts, revealing novel challenges for both modeling and evaluation. Comprehensive metrics, error taxonomies, and competitive model baselines define the landscape, with state-of-the-art (SOTA) results driven by byte-level models for Latin scripts and modular LLM pipelines for broader multilingual coverage (Samuel et al., 2021, Buaphet et al., 23 Jan 2026).

1. Benchmark Composition and Data

MultiLexNorm consists of social media datasets sourced from platforms such as Twitter, Instagram, Facebook, DC Inside, and WRIME, spanning multiple languages and scripts. The original 12 datasets (covering 11 languages, including two code-switched sets) are predominantly Indo-European with the Latin alphabet. These datasets feature word-level normalization targets, requiring systems to convert informal or noisy input into canonical text, sometimes involving splits/merges or language-specific edits.

MultiLexNorm++ expands coverage to five Asian languages—Indonesian (Latin script), Japanese (Kanji+Hiragana+Katakana), Korean (Hangul), Thai (Thai script), and Vietnamese (Latin+diacritics). The extended benchmark introduces script-specific error sources (e.g., Thai tone-mark omissions, Korean jamo variants), sociolinguistic phenomena, and diverse token normalization densities (e.g., Indonesian: 47.5% non-standard, Thai: 4.7% normalized) (Buaphet et al., 23 Jan 2026).

Dataset	Language	Script	Tokens	% Non-standard/Normalized
Danish	DA	Latin	11,816	9.25%
German	DE	Latin	25,157	17.96%
English	EN	Latin	73,806	6.9%
Indonesian	IN	Latin	48,716	47.5%
Japanese	JA	Kanji+Hiragana+Katakana	95,411	7%
Korean	KO	Hangul	16,618	7.5%
Thai	TH	Thai	169,751	4.7%
Vietnamese	VI	Latin+diacritics	128,685	15.98%

Annotation is performed in two passes by native speakers, emphasizing word-level corrections. Inter-annotator agreement (κ) ranges from 0.88 to 0.92 on sampled Asian datasets.

2. Task Definition and Error Taxonomy

The core MultiLexNorm task is defined as mapping an input token sequence $x = (x_1, ..., x_n)$ to an output $y = (y_1, ..., y_n)$ , where each $y_i$ is either the canonical form of $x_i$ (if $x_i$ is non-standard) or $x_i$ itself (if already standard).

Normalization types encompass:

Orthographic errors (deletion, insertion, substitution, transposition)
Accent mark restoration/removal
Casing (sentence-initial, named entities)
Apostrophe edits
Colloquial/nonstandard forms (“g2g”→“got to go”, vowel omission, repeated characters)
Language-specific expansions/reversals (e.g., Indonesian “laki2nya”→“laki-lakinya”)
Script-specific errors (e.g., Thai tone marks, Korean jamo compositionality)

MultiLexNorm++ introduces error subcategories for Asian scripts: dialectal/phonetically-driven variants, foreign loanword transliterations, and script-native phenomena (e.g., non-standard kana use in Japanese) (Buaphet et al., 23 Jan 2026).

3. Evaluation Metrics and Methodologies

Intrinsic evaluation leverages Error-Reduction Rate (ERR), which normalizes raw accuracy against a leave-as-is (LAI) baseline:

$\mathrm{ERR} = \frac{A_\mathrm{sys} - A_\mathrm{lai}}{1 - A_\mathrm{lai}}$

where $A_\mathrm{sys}$ is system-level accuracy, $A_\mathrm{lai}$ is baseline accuracy. Macro-averaging across languages yields the final rank.

For precision and recall on normalized tokens:

$F_1 = 2\frac{P \cdot R}{P + R}$

Additional metrics include cross-entropy loss for models optimized via likelihood.

Extrinsic evaluation involves dependency parsing: normalized corpora are parsed with MaChAmp using Universal Dependencies (UD) treebanks. Parsing quality is measured by Label Attachment Score (LAS), macro-averaged over test sets for each submission (Samuel et al., 2021).

4. Baseline Systems and State-of-the-Art Models

MultiLexNorm has established several baselines:

LAI (Leave-As-Is): unchanged input tokens.
MFR (Most Frequent Replacement): majority-vote replacement per token type.
MoNoise: hybrid (Aspell, FastText, handcrafted heuristics).

ÚFAL’s ByT5-based approach established initial SOTA on MultiLexNorm:

ByT5 (byte-level T5, 12 layers, hidden size 768, ~300M params) fine-tuned on synthetic and authentic normalization data.
Synthetic pre-training involves noise induction from Wikipedia, parameterized by corpus-observed edit probabilities (e.g., accent removal/addition, apostrophe manipulation).
Input is fed word-by-word wrapped in sentinel tokens; output may be single-token or multi-token expansion.

Performance summary (Macro-ERR 12 languages): | Team | Avg ERR (%) | Top Single Language | |------------------|-------------|--------------------| | ByT5 (ensemble) | 67.3 | Slovenian (80.1) | | HEL-LJU | 53.6 | Slovenian (67.0) | | MoNoise | 49.0 | English (74.3) | | MFR | 38.4 | English (64.9) | | LAI | 0.0 | – |

ByT5 demonstrated robust accuracy on Latin scripts (ERR boosts up to +25.6 over mT5), with further gains via synthetic pre-training (+5.7), train+dev fine-tuning (+1.3), and ensemble averaging (+1.1) (Samuel et al., 2021).

In MultiLexNorm++, ByT5’s performance on Asian scripts degrades (e.g., ERR < 0 on Korean), necessitating new architectures:

Modular pipeline: (i) token-level detection (XLM-R), (ii) low-entropy dictionary lookup, (iii) in-context LLM normalization (Gemma-3-27B-it, Llama-3-70B, Qwen2.5-72B, GPT-4o).
Prompted in-context learning with few-shot examples; greedy or beam decoding.
No supervised fine-tuning; relies on model generalization and prompt engineering (Buaphet et al., 23 Jan 2026).

5. Experimental Results

On the original MultiLexNorm 12-language suite, ByT5 (ensemble) remains SOTA, achieving 67.3% macro-ERR and 64.17% LAS in dependency parsing. For Asian languages added in MultiLexNorm++, performance metrics demonstrate the following (ERR, F1):

Language	ByT5 (ÚFAL)	GPT-4o	LLaMA3	Qwen2.5	Gemma3
Indonesian	67.94	68.35	63.24	64.73	66.45
Japanese	39.24	22.52	16.27	20.34	14.66
Korean	–2.38	–8.73	–14.82	–6.88	–8.73
Thai	6.90	43.95	40.39	42.23	43.09
Vietnamese	77.35	80.04	76.41	77.79	77.57

LLM-based models produce more robust results on non-Latin scripts compared to ByT5, especially for Thai and Vietnamese, but performance in Japanese and Korean remains low (a plausible implication is the need for script-specific representations and training).

6. Error Analysis and Limitations

Error breakdowns on new languages reveal:

Wrong normalization choice dominates errors.
Over-normalization (unwarranted token substitution) and under-normalization (missed corrections) are frequent.
Detection misses (tokens not flagged for normalization in upstream sequence labeling).

Manual analysis indicates remaining challenges in spelling, phonetic variants, loanword transliterations, abbreviations, and informal slang. LLMs exhibit differential strengths on abbreviations and loanwords (Qwen2.5) but struggle on phonetic and dialectal errors.

Limitations of current SOTA include:

Byte-level modeling degrades substantially on non-Latin scripts, evidenced by negative ERR on Korean.
Efficiency constraints, as ByT5 encodes each token independently (≈90 words/s on RTX 3090).
Synthetic pre-training relies on accurately estimating per-language error probabilities, which can be prohibitive for low-resource settings.

7. Open Problems and Future Directions

Key directions for further research highlighted by MultiLexNorm++ include:

Script-specific model architectures (e.g., sub-character, stroke-level representations for languages such as Japanese, Korean, Thai).
Multilingual pretraining on naturally "noisy" social media data, replacing synthetic noise induction.
Joint training for normalization plus downstream tasks (NER, parsing), to improve overall robustness.
Expanding the normalization task to sentence-level or document-level mappings to capture context-dependent splits/merges.

MultiLexNorm and MultiLexNorm++ collectively demonstrate the necessity of adaptive modeling, sophisticated error taxonomies, and targeted augmentation for robust lexical normalization in multilingual and multi-script social media data (Samuel et al., 2021, Buaphet et al., 23 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

ÚFAL at MultiLexNorm 2021: Improving Multilingual Lexical Normalization by Fine-tuning ByT5 (2021)

MultiLexNorm++: A Unified Benchmark and a Generative Model for Lexical Normalization for Asian Languages (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MultiLexNorm Benchmark.

MultiLexNorm Benchmark & Extensions

1. Benchmark Composition and Data

2. Task Definition and Error Taxonomy

3. Evaluation Metrics and Methodologies

4. Baseline Systems and State-of-the-Art Models

5. Experimental Results

6. Error Analysis and Limitations

7. Open Problems and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MultiLexNorm Benchmark & Extensions

1. Benchmark Composition and Data

2. Task Definition and Error Taxonomy

3. Evaluation Metrics and Methodologies

4. Baseline Systems and State-of-the-Art Models

5. Experimental Results

6. Error Analysis and Limitations

7. Open Problems and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research