Exploring Robustness of Multilingual LLMs on Real-World Noisy Data

Published 14 Jan 2025 in cs.CL | (2501.08322v1)

Abstract: LLMs are trained on Web data that might contain spelling errors made by humans. But do they become robust to similar real-world noise? In this paper, we investigate the effect of real-world spelling mistakes on the performance of 9 LLMs, with parameters ranging from 0.2B to 13B, in 3 different NLP tasks, namely Natural Language Inference (NLI), Name Entity Recognition (NER), and Intent Classification (IC). We perform our experiments on 6 different languages and build a dictionary of real-world noise for them using the Wikipedia edit history. We show that the performance gap of the studied models on the clean and noisy test data averaged across all the datasets and languages ranges from 2.3 to 4.3 absolute percentage points. In addition, mT5 models, in general, show more robustness compared to BLOOM, Falcon, and BERT-like models. In particular, mT5 (13B), was the most robust on average overall, across the 3 tasks, and in 4 of the 6 languages.

Abstract PDF Upgrade to Chat

Summary

The paper investigates the robustness of multilingual LLMs to real-world spelling errors across NLI, NER, and IC tasks in six languages using the WikiTypo corpus.
Findings indicate performance gaps of 2.3-4.3 percentage points between clean and noisy data, with mT5 models, especially the 13B version, exhibiting greater robustness than BLOOM and Falcon.
All models demonstrate vulnerability to noise, with robustness influenced by model size, architecture, and task, and the study highlights limitations regarding model size, language coverage, and noise types evaluated.

The paper investigates the robustness of multilingual LLMs to real-world spelling errors across multiple languages and tasks. The study evaluates nine LLMs, ranging from 0.2B to 13B parameters, on three NLP (Natural Language Processing) tasks: NLI (Natural Language Inference), NER (Named Entity Recognition), and Intent Classification (IC). The experiments are conducted on six languages, using a novel dictionary of real-world noise called WikiTypo, which is built from Wikipedia edit history.

The authors address three primary research questions:

Are larger models more robust to real-world noisy data than smaller models?
Are different tasks equally sensitive to real-world noise?
How does model performance differ from English to other languages under noise?

The findings indicate that the performance gap between clean and noisy test data ranges from 2.3 to 4.3 absolute percentage points. The mT5 models generally exhibit greater robustness compared to BLOOM, Falcon, and BERT-like models. Specifically, the mT5 (13B) model demonstrates the highest average robustness across tasks and in most of the tested languages.

The study constructs the WikiTypo corpus by parsing Wikipedia revisions and extracting pairs of words with a Levenshtein edit distance of 1, excluding pairs with special characters or fewer than two characters. The resulting dictionary contains a varying number of typos for each language: English (9370), German (15000), Spanish (8200), French (14900), Hindi (3300), and Turkish (4060). For NER, the NLPAug library is used to generate noisy data due to the limited number of misspelled proper nouns in the WikiTypo corpus. To create noisy test sets, words in a sentence are randomly replaced with their incorrect versions from the noise dictionary, with hyperparameters $r=0.2$ (ratio of sentence to be changed) and $m=4$ (maximum augmented words).

The models are fine-tuned on multilingual datasets, and their performance is assessed on both clean and noisy test sets. The datasets used include SNIPS for intent classification, XNLI for natural language inference, and WikiANN for named entity recognition. To ensure uniformity, each model is trained on a combined, shuffled dataset containing all six languages.

The experimental setup includes mBERT-base, XLM-RoBERTa-base, multiple versions of mT5, Falcon-7B, and BLOOM-7B. The mT5 architecture, similar to T5, employs an encoder-decoder Transformer architecture. Falcon models use a decoder-only setup based on PaLM, while BLOOM also utilizes a causal-decoder Transformer model.

The fine-tuning process involves monitoring validation and training loss values to prevent overfitting, with most models requiring only two epochs. Select models are fine-tuned for six epochs to ensure convergence.

The results show that the largest mT5 model (13B) is the most robust, while the smallest (mT5-300M) is the most vulnerable. Across different architectures and tokenizers, mT5 models exhibit less vulnerability to typos, with an average performance degradation of 2.27%, compared to Falcon (3.67%) and BLOOM (4.27%). The NLI task generally shows the largest performance gaps, while intent classification demonstrates the smallest. Decoder-only models like BLOOM and Falcon tend to exhibit larger gaps in the NER task. Furthermore, English exhibits the highest performance gap, potentially due to the quantity and types of noise inserted.

Introducing noise to the training data narrows the performance gap, as demonstrated by fine-tuning the BLOOM model on noisy WikiANN data. An analysis of part-of-speech (PoS) tags within the injected noise reveals a higher proportion of noisy verb instances in English compared to other languages, contributing to the language's higher performance gap on the XNLI dataset.

The authors conclude that all models exhibit vulnerability to noisy input, and that model robustness is influenced by factors such as training data size and language coverage, architectural design, parameter count, and the specific task.

The authors identify several limitations: the evaluation did not include models larger than 13B parameters or consider more than six languages. The study also primarily used Wikipedia edits for noise generation, which may not capture all types of real-world noise, and focused on a single noise insertion ratio. Finally, the WikiTypo corpus is limited in proper noun examples, necessitating the use of different strategies for generating noisy test sets for the WikiANN dataset.

Markdown