TwHIN-BERT: Socially-Enriched Multilingual PLM
- TwHIN-BERT is a multilingual pre-trained language model that incorporates Twitter's social signals to improve representations of short, noisy tweet texts.
- It combines masked language modeling with a novel TwHIN contrastive loss derived from heterogeneous social engagement, yielding substantial performance gains on social tasks.
- Trained on 7 billion tweets across 100+ languages, it demonstrates enhanced results in hashtag prediction, social engagement, and tweet classification benchmarks.
TwHIN-BERT is a production-grade, multilingual, socially-enriched pre-trained LLM (PLM) developed at Twitter, specifically designed to encode short, noisy, user-generated text on the Twitter platform. It leverages both textual self-supervision and a novel social engagement signal derived from the Twitter Heterogeneous Information Network (TwHIN), yielding substantial improvements over prior PLMs on multilingual social recommendation and semantic understanding tasks. TwHIN-BERT is trained on 7 billion tweets spanning over 100 languages and is openly released alongside new multilingual benchmarks for hashtag prediction and social engagement (Zhang et al., 2022).
1. Motivation: Modeling Tweets with Social Signals
Standard PLMs, trained on relatively clean and well-structured corpora such as books, Wikipedia, or CommonCrawl, underperform on tweets due to the unique challenges of microblogging text. Tweets are typically very short (often under 20 tokens), highly noisy (misspellings, abbreviations, emojis, code-mixing, unconventional punctuation), and make heavy use of topical tokens (hashtags, @-mentions) that encapsulate much of the semantic content.
Social signals on Twitter—in the form of favorites, retweets, replies, follows, and quote-tweets—encode implicit "ground truth" about topical or semantic similarity: tweets co-engaged by the same users are likely related, even if textual cues are sparse (e.g., “bottom of the ninth, two outs, down by one!!”). TwHIN-BERT leverages these engagement patterns using the Twitter Heterogeneous Information Network (TwHIN): a bipartite, typed-edge graph linking users and tweets with engagement metadata. This socially-informed supervision is intended to align tweet representations according to real-world user interaction patterns, supplementing limited textual signal with rich relational context.
2. Architecture and Pre-training Objectives
2.1 Transformer Backbone and Tokenization
TwHIN-BERT adopts the same Transformer architecture as BERT and XLM-R. The base configuration comprises 12 layers, 768-dimensional hidden states, and 12 attention heads; the large configuration uses 24 layers, 1,024-dimensional hidden states, and 16 attention heads. Tokenization employs the 250,000-subword multilingual SentencePiece unigram model from XLM-R, enabling effective handling of code-mixing and non-standard linguistic forms within tweets. The maximum input sequence length is set to 128 tokens.
For socially-aware contrastive learning, a 2-layer MLP projection head produces embeddings from the pooled [CLS] outputs of the Transformer: [768 → 768] for base, [1024 → 512] for large.
2.2 Text-based Self-supervision: Masked Language Modeling
TwHIN-BERT incorporates the standard masked language modeling (MLM) objective. Given a token sequence with a masked subset , the MLM loss is
2.3 Social Engagement Objective: TwHIN Contrastive Loss
TwHIN is constructed as a bipartite graph , where denotes users, tweets, observed engagements, with edge types . User embeddings , tweet embeddings 0, and relation embeddings 1 are learned via a translation-based link prediction objective:
2
3
where 4 is the sigmoid, and 5 are negative samples.
One billion+ tweet embeddings are indexed using FAISS (IVF+PQ). For each tweet, top-k cosine nearest neighbors in the TwHIN embedding space are mined as socially similar "positive pairs." The contrastive loss (NT-Xent) operates on LM-projected embeddings (6):
7
where 8, 9, and 0 denotes the 1 negatives in a batch of 2 pairs.
2.4 Joint Optimization
TwHIN-BERT is optimized via a joint loss 3, with 4 in both base and large models. Training proceeds in two stages: initial MLM-only pre-training on 6B tweets (500K steps), followed by joint MLM + social objective on 1B tweets with engagement logs (500K steps).
3. Training Data and Multilingual Design
TwHIN-BERT is pre-trained on a corpus of 7 billion tweets (collected January 2020–June 2022), spanning over 100 languages as determined by fastText language ID. Resampling by frequency5 is employed to up-weight low-resource languages. Of these, 1B tweets possess full engagement logs, contributed by approximately 200 million users and 100 billion engagement edges in TwHIN.
Preprocessing replicates XLM-R conventions: Unicode normalization, use of URL/mention placeholders, and language filtering via lid.176.bin. Multilingual subword tokenization supports code-mixing, a common phenomenon in tweets.
4. Evaluation Tasks and Empirical Results
4.1 Downstream Benchmarks
TwHIN-BERT is benchmarked on three task types, each in 50 languages except where noted.
- Social engagement prediction: For a user embedding 6 and LM-pooled tweet embedding 7, a learned link predictor classifies whether 8 would engage 9. HITS@10 is used (rank positive among 1,000 candidates).
- Hashtag prediction: Predicts which of the top-500 hashtags occurs in a tweet. Macro-F1 is the metric.
- Standard Tweet classification: Includes SemEval2017 (English sentiment, avg recall), SemEval2018 (English/Spanish emoji, macro-F1), ASAD (Arabic sentiment, avg recall), COVID-JA (Japanese topic, accuracy), and SemEval2020 (Hindi/English and Spanish/English sentiment, accuracy).
4.2 Baseline Models
Evaluation compares TwHIN-BERT to:
- BERTweet: English-only, Twitter-corpus RoBERTa.
- mBERT: Multilingual BERT trained on Wikipedia.
- XLM-R: Multilingual RoBERTa on CommonCrawl.
- XLM-T: XLM-R further pre-trained on 200M tweets.
- TwHIN-BERT-MLM: MLM-only ablation of TwHIN-BERT trained on 7B tweets.
4.3 Empirical Results
| Task | Model | Metric | Score |
|---|---|---|---|
| Engagement Prediction | mBERT | HITS@10 | 0.0732 |
| XLM-R | HITS@10 | 0.0849 | |
| XLM-T | HITS@10 | 0.1043 | |
| TwHIN-BERT-MLM | HITS@10 | 0.1161 | |
| TwHIN-BERT | HITS@10 | 0.1436 | |
| TwHIN-BERT-large | HITS@10 | 0.1497 | |
| Hashtag Prediction | mBERT | macro-F1 | 50.05 |
| XLM-R | macro-F1 | 50.86 | |
| XLM-T | macro-F1 | 51.74 | |
| TwHIN-BERT-MLM | macro-F1 | 53.66 | |
| TwHIN-BERT | macro-F1 | 54.62 | |
| TwHIN-BERT-large | macro-F1 | 55.23 | |
| Tweet Classification | mBERT | avg metric | 53.51 |
| XLM-R | avg metric | 57.49 | |
| XLM-T | avg metric | 58.52 | |
| TwHIN-BERT-MLM | avg metric | 59.00 | |
| TwHIN-BERT | avg metric | 59.38 | |
| TwHIN-BERT-large | avg metric | 60.06 |
TwHIN-BERT demonstrates a significant performance increase—37.6% relative improvement over XLM-T on social engagement prediction, 5.6% on hashtag prediction, and 1.5% on classification tasks, measured as macro average (Zhang et al., 2022).
4.4 Ablation
Removal of the social loss (TwHIN-BERT-MLM) causes a marked drop in social task metrics (e.g., HITS@10 drops from 0.1436 to 0.1161), with smaller decreases on semantic classification. This supports the claim that social pre-training primarily strengthens social recommendation while maintaining or modestly improving semantic understanding.
5. Open-Source Resources
TwHIN-BERT and associated resources are provided for research and downstream deployment:
- Model checkpoints and code: GitHub, HuggingFace (base), HuggingFace (large).
- Benchmark datasets: Multilingual hashtag prediction and social engagement prediction (50 languages each) via public API or archived splits.
Immediate reproduction, fine-tuning, and extension of socially-aware tweet representations are supported through these resources.
6. Significance and Context within Social NLP
TwHIN-BERT establishes a new standard for integrating social-graph structure with PLM pre-training for microposts, directly addressing the disconnect between standard PLMs and noisy, short-form, social user-generated content. While prior techniques such as continual tweet pre-training (XLM-T) improved over static PLMs, the explicit incorporation of heterogeneous social engagement objectives distinguishes TwHIN-BERT. The approach leverages implicit, large-scale "ground truth" from real user behavior, harnessing social homophily and engagement-induced semantic affinity, yielding improved representations along both social and semantic axes. This enables more effective downstream systems for social recommendation and understanding in a multilingual, real-world setting (Zhang et al., 2022).