Papers
Topics
Authors
Recent
Search
2000 character limit reached

TwHIN-BERT: Socially-Enriched Multilingual PLM

Updated 3 April 2026
  • TwHIN-BERT is a multilingual pre-trained language model that incorporates Twitter's social signals to improve representations of short, noisy tweet texts.
  • It combines masked language modeling with a novel TwHIN contrastive loss derived from heterogeneous social engagement, yielding substantial performance gains on social tasks.
  • Trained on 7 billion tweets across 100+ languages, it demonstrates enhanced results in hashtag prediction, social engagement, and tweet classification benchmarks.

TwHIN-BERT is a production-grade, multilingual, socially-enriched pre-trained LLM (PLM) developed at Twitter, specifically designed to encode short, noisy, user-generated text on the Twitter platform. It leverages both textual self-supervision and a novel social engagement signal derived from the Twitter Heterogeneous Information Network (TwHIN), yielding substantial improvements over prior PLMs on multilingual social recommendation and semantic understanding tasks. TwHIN-BERT is trained on 7 billion tweets spanning over 100 languages and is openly released alongside new multilingual benchmarks for hashtag prediction and social engagement (Zhang et al., 2022).

1. Motivation: Modeling Tweets with Social Signals

Standard PLMs, trained on relatively clean and well-structured corpora such as books, Wikipedia, or CommonCrawl, underperform on tweets due to the unique challenges of microblogging text. Tweets are typically very short (often under 20 tokens), highly noisy (misspellings, abbreviations, emojis, code-mixing, unconventional punctuation), and make heavy use of topical tokens (hashtags, @-mentions) that encapsulate much of the semantic content.

Social signals on Twitter—in the form of favorites, retweets, replies, follows, and quote-tweets—encode implicit "ground truth" about topical or semantic similarity: tweets co-engaged by the same users are likely related, even if textual cues are sparse (e.g., “bottom of the ninth, two outs, down by one!!”). TwHIN-BERT leverages these engagement patterns using the Twitter Heterogeneous Information Network (TwHIN): a bipartite, typed-edge graph linking users and tweets with engagement metadata. This socially-informed supervision is intended to align tweet representations according to real-world user interaction patterns, supplementing limited textual signal with rich relational context.

2. Architecture and Pre-training Objectives

2.1 Transformer Backbone and Tokenization

TwHIN-BERT adopts the same Transformer architecture as BERT and XLM-R. The base configuration comprises 12 layers, 768-dimensional hidden states, and 12 attention heads; the large configuration uses 24 layers, 1,024-dimensional hidden states, and 16 attention heads. Tokenization employs the 250,000-subword multilingual SentencePiece unigram model from XLM-R, enabling effective handling of code-mixing and non-standard linguistic forms within tweets. The maximum input sequence length is set to 128 tokens.

For socially-aware contrastive learning, a 2-layer MLP projection head produces embeddings ztz_t from the pooled [CLS] outputs of the Transformer: [768 → 768] for base, [1024 → 512] for large.

2.2 Text-based Self-supervision: Masked Language Modeling

TwHIN-BERT incorporates the standard masked language modeling (MLM) objective. Given a token sequence x=(w1wn)x = (w_1…w_n) with a masked subset MM, the MLM loss is

Ltext=Ex[iMlogP(wix¬M)].L_{\text{text}} = \mathbb{E}_x \left[ -\sum_{i \in M} \log P(w_i \mid x_{¬M}) \right].

2.3 Social Engagement Objective: TwHIN Contrastive Loss

TwHIN is constructed as a bipartite graph G=(U,T,E,ϕ)G=(U, T, E, \phi), where UU denotes users, TT tweets, EU×TE\subset U\times T observed engagements, with edge types ϕ(e){Favorite, Retweet, ...}\phi(e)\in\{\text{Favorite, Retweet, ...}\}. User embeddings uju_j, tweet embeddings x=(w1wn)x = (w_1…w_n)0, and relation embeddings x=(w1wn)x = (w_1…w_n)1 are learned via a translation-based link prediction objective:

x=(w1wn)x = (w_1…w_n)2

x=(w1wn)x = (w_1…w_n)3

where x=(w1wn)x = (w_1…w_n)4 is the sigmoid, and x=(w1wn)x = (w_1…w_n)5 are negative samples.

One billion+ tweet embeddings are indexed using FAISS (IVF+PQ). For each tweet, top-k cosine nearest neighbors in the TwHIN embedding space are mined as socially similar "positive pairs." The contrastive loss (NT-Xent) operates on LM-projected embeddings (x=(w1wn)x = (w_1…w_n)6):

x=(w1wn)x = (w_1…w_n)7

where x=(w1wn)x = (w_1…w_n)8, x=(w1wn)x = (w_1…w_n)9, and MM0 denotes the MM1 negatives in a batch of MM2 pairs.

2.4 Joint Optimization

TwHIN-BERT is optimized via a joint loss MM3, with MM4 in both base and large models. Training proceeds in two stages: initial MLM-only pre-training on 6B tweets (500K steps), followed by joint MLM + social objective on 1B tweets with engagement logs (500K steps).

3. Training Data and Multilingual Design

TwHIN-BERT is pre-trained on a corpus of 7 billion tweets (collected January 2020–June 2022), spanning over 100 languages as determined by fastText language ID. Resampling by frequencyMM5 is employed to up-weight low-resource languages. Of these, 1B tweets possess full engagement logs, contributed by approximately 200 million users and 100 billion engagement edges in TwHIN.

Preprocessing replicates XLM-R conventions: Unicode normalization, use of URL/mention placeholders, and language filtering via lid.176.bin. Multilingual subword tokenization supports code-mixing, a common phenomenon in tweets.

4. Evaluation Tasks and Empirical Results

4.1 Downstream Benchmarks

TwHIN-BERT is benchmarked on three task types, each in 50 languages except where noted.

  • Social engagement prediction: For a user embedding MM6 and LM-pooled tweet embedding MM7, a learned link predictor classifies whether MM8 would engage MM9. HITS@10 is used (rank positive among 1,000 candidates).
  • Hashtag prediction: Predicts which of the top-500 hashtags occurs in a tweet. Macro-F1 is the metric.
  • Standard Tweet classification: Includes SemEval2017 (English sentiment, avg recall), SemEval2018 (English/Spanish emoji, macro-F1), ASAD (Arabic sentiment, avg recall), COVID-JA (Japanese topic, accuracy), and SemEval2020 (Hindi/English and Spanish/English sentiment, accuracy).

4.2 Baseline Models

Evaluation compares TwHIN-BERT to:

  • BERTweet: English-only, Twitter-corpus RoBERTa.
  • mBERT: Multilingual BERT trained on Wikipedia.
  • XLM-R: Multilingual RoBERTa on CommonCrawl.
  • XLM-T: XLM-R further pre-trained on 200M tweets.
  • TwHIN-BERT-MLM: MLM-only ablation of TwHIN-BERT trained on 7B tweets.

4.3 Empirical Results

Task Model Metric Score
Engagement Prediction mBERT HITS@10 0.0732
XLM-R HITS@10 0.0849
XLM-T HITS@10 0.1043
TwHIN-BERT-MLM HITS@10 0.1161
TwHIN-BERT HITS@10 0.1436
TwHIN-BERT-large HITS@10 0.1497
Hashtag Prediction mBERT macro-F1 50.05
XLM-R macro-F1 50.86
XLM-T macro-F1 51.74
TwHIN-BERT-MLM macro-F1 53.66
TwHIN-BERT macro-F1 54.62
TwHIN-BERT-large macro-F1 55.23
Tweet Classification mBERT avg metric 53.51
XLM-R avg metric 57.49
XLM-T avg metric 58.52
TwHIN-BERT-MLM avg metric 59.00
TwHIN-BERT avg metric 59.38
TwHIN-BERT-large avg metric 60.06

TwHIN-BERT demonstrates a significant performance increase—37.6% relative improvement over XLM-T on social engagement prediction, 5.6% on hashtag prediction, and 1.5% on classification tasks, measured as macro average (Zhang et al., 2022).

4.4 Ablation

Removal of the social loss (TwHIN-BERT-MLM) causes a marked drop in social task metrics (e.g., HITS@10 drops from 0.1436 to 0.1161), with smaller decreases on semantic classification. This supports the claim that social pre-training primarily strengthens social recommendation while maintaining or modestly improving semantic understanding.

5. Open-Source Resources

TwHIN-BERT and associated resources are provided for research and downstream deployment:

Immediate reproduction, fine-tuning, and extension of socially-aware tweet representations are supported through these resources.

6. Significance and Context within Social NLP

TwHIN-BERT establishes a new standard for integrating social-graph structure with PLM pre-training for microposts, directly addressing the disconnect between standard PLMs and noisy, short-form, social user-generated content. While prior techniques such as continual tweet pre-training (XLM-T) improved over static PLMs, the explicit incorporation of heterogeneous social engagement objectives distinguishes TwHIN-BERT. The approach leverages implicit, large-scale "ground truth" from real user behavior, harnessing social homophily and engagement-induced semantic affinity, yielding improved representations along both social and semantic axes. This enables more effective downstream systems for social recommendation and understanding in a multilingual, real-world setting (Zhang et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TwHIN-BERT.