Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond (2104.12250v2)

Published 25 Apr 2021 in cs.CL

Abstract: LLMs are ubiquitous in current NLP, and their multilingual capacity has recently attracted considerable attention. However, current analyses have almost exclusively focused on (multilingual variants of) standard benchmarks, and have relied on clean pre-training and task-specific corpora as multilingual signals. In this paper, we introduce XLM-T, a model to train and evaluate multilingual LLMs in Twitter. In this paper we provide: (1) a new strong multilingual baseline consisting of an XLM-R (Conneau et al. 2020) model pre-trained on millions of tweets in over thirty languages, alongside starter code to subsequently fine-tune on a target task; and (2) a set of unified sentiment analysis Twitter datasets in eight different languages and a XLM-T model fine-tuned on them.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Francesco Barbieri (29 papers)
  2. Luis Espinosa Anke (9 papers)
  3. Jose Camacho-Collados (58 papers)
Citations (184)

Summary

Multilingual LLMs in Social Media: Insights from XLM-Twitter

The paper "XLM-T: Multilingual LLMs in Twitter for Sentiment Analysis and Beyond" navigates the intersection of multilingual NLP and social media platforms, specifically Twitter. It identifies a notable lacuna in the application of widely adopted multilingual LLMs to the inherently noisy and diverse data environment of Twitter, a platform characterized by lexical nonuniformity, slang, abbreviations, and multilingual expressions.

Key Contributions

The core contribution of the paper is the introduction of XLM-Twitter, a multilingual model derived from XLM-R, pre-trained extensively using a corpus of 198 million tweets encompassing over thirty languages. The authors establish two primary components within their research framework:

  1. Multilingual Pre-training Baseline: Leveraging the XLM-R model, XLM-Twitter is finely tuned using an expansive multilingual Twitter dataset. This relies on the model's ability to harness diverse data streams effectively, thereby ameliorating its adaptability to a broad spectrum of languages used on Twitter.
  2. Unified Sentiment Analysis Datasets: The paper introduces a comprehensive multilingual benchmark comprising sentiment analysis data from eight diverse languages. This unified dataset allows for extensive explorations of the model's zero-shot and cross-lingual performance, demonstrating how it can provide insights beyond conventional monolingual settings.

Methodology and Evaluation

The authors adopted a rigorous methodology involving the continual training of the XLM-R model with Twitter-specific data, followed by fine-tuning for sentiment analysis. They utilized the adapter technique for fine-tuning, allowing the main LLM to remain static while optimizing parameters for the additional layers related to specific tasks, thus enhancing efficiency and adaptability.

Evaluation was conducted across several paradigms:

  • Monolingual Evaluation: Using TweetEval benchmarks, XLM-Twitter displayed substantial competency across seven English-specific Twitter classification tasks.
  • Multilingual and Cross-lingual Evaluation: The model's proficiency was further tested in zero-shot and multilingual scenarios, where it significantly outperformed the vanilla XLM-R model in most cases. Particularly noteworthy were enhancements in typologically distant languages and under-represented languages such as Hindi, where large-scale multilingual training provided discernible performance improvements.

Implications and Future Directions

The implications of this paper are substantial both from practical and theoretical perspectives. Practically, the deployment of domain-specific multilingual models has pivotal applications in industries engaged with social media analytics, sentiment analysis, and automated content moderation across diverse linguistic communities. Theoretically, the research corroborates the hypothesis that multilingual models pre-trained specifically for social media environments exhibit enhanced generalization capabilities, even in typologically diverse languages.

Looking ahead, the paper suggests several avenues for further research, including exploring the extension of this framework to additional languages and NLP tasks, and augmenting cross-lingual zero-shot analysis to optimize performance across linguistically related groups. Additionally, considerations of how evolving trends on social media may impact model performance could yield valuable insights into the temporal dynamics of model efficacy in fast-paced data domains like Twitter.

In conclusion, the paper establishes a valuable resource in the XLM-Twitter LLM and sets a benchmark for subsequent investigations into multilingual NLP in the context of social media, underscoring the significance of tailored, domain-specific pre-training in this continually evolving field.