TwHIN-BERT: Socially-Enriched Multilingual PLM

Updated 3 April 2026

TwHIN-BERT is a multilingual pre-trained language model that incorporates Twitter's social signals to improve representations of short, noisy tweet texts.
It combines masked language modeling with a novel TwHIN contrastive loss derived from heterogeneous social engagement, yielding substantial performance gains on social tasks.
Trained on 7 billion tweets across 100+ languages, it demonstrates enhanced results in hashtag prediction, social engagement, and tweet classification benchmarks.

TwHIN-BERT is a production-grade, multilingual, socially-enriched pre-trained LLM (PLM) developed at Twitter, specifically designed to encode short, noisy, user-generated text on the Twitter platform. It leverages both textual self-supervision and a novel social engagement signal derived from the Twitter Heterogeneous Information Network (TwHIN), yielding substantial improvements over prior PLMs on multilingual social recommendation and semantic understanding tasks. TwHIN-BERT is trained on 7 billion tweets spanning over 100 languages and is openly released alongside new multilingual benchmarks for hashtag prediction and social engagement (Zhang et al., 2022).

Standard PLMs, trained on relatively clean and well-structured corpora such as books, Wikipedia, or CommonCrawl, underperform on tweets due to the unique challenges of microblogging text. Tweets are typically very short (often under 20 tokens), highly noisy (misspellings, abbreviations, emojis, code-mixing, unconventional punctuation), and make heavy use of topical tokens (hashtags, @-mentions) that encapsulate much of the semantic content.

Social signals on Twitter—in the form of favorites, retweets, replies, follows, and quote-tweets—encode implicit "ground truth" about topical or semantic similarity: tweets co-engaged by the same users are likely related, even if textual cues are sparse (e.g., “bottom of the ninth, two outs, down by one!!”). TwHIN-BERT leverages these engagement patterns using the Twitter Heterogeneous Information Network (TwHIN): a bipartite, typed-edge graph linking users and tweets with engagement metadata. This socially-informed supervision is intended to align tweet representations according to real-world user interaction patterns, supplementing limited textual signal with rich relational context.

2. Architecture and Pre-training Objectives

2.1 Transformer Backbone and Tokenization

TwHIN-BERT adopts the same Transformer architecture as BERT and XLM-R. The base configuration comprises 12 layers, 768-dimensional hidden states, and 12 attention heads; the large configuration uses 24 layers, 1,024-dimensional hidden states, and 16 attention heads. Tokenization employs the 250,000-subword multilingual SentencePiece unigram model from XLM-R, enabling effective handling of code-mixing and non-standard linguistic forms within tweets. The maximum input sequence length is set to 128 tokens.

For socially-aware contrastive learning, a 2-layer MLP projection head produces embeddings $z_t$ from the pooled [CLS] outputs of the Transformer: [768 → 768] for base, [1024 → 512] for large.

2.2 Text-based Self-supervision: Masked Language Modeling

TwHIN-BERT incorporates the standard masked language modeling (MLM) objective. Given a token sequence $x = (w_1…w_n)$ with a masked subset $M$ , the MLM loss is

$L_{\text{text}} = \mathbb{E}_x \left[ -\sum_{i \in M} \log P(w_i \mid x_{¬M}) \right].$

TwHIN is constructed as a bipartite graph $G=(U, T, E, \phi)$ , where $U$ denotes users, $T$ tweets, $E\subset U\times T$ observed engagements, with edge types $\phi(e)\in\{\text{Favorite, Retweet, ...}\}$ . User embeddings $u_j$ , tweet embeddings $x = (w_1…w_n)$ 0, and relation embeddings $x = (w_1…w_n)$ 1 are learned via a translation-based link prediction objective:

$x = (w_1…w_n)$ 2

$x = (w_1…w_n)$ 3

where $x = (w_1…w_n)$ 4 is the sigmoid, and $x = (w_1…w_n)$ 5 are negative samples.

One billion+ tweet embeddings are indexed using FAISS (IVF+PQ). For each tweet, top-k cosine nearest neighbors in the TwHIN embedding space are mined as socially similar "positive pairs." The contrastive loss (NT-Xent) operates on LM-projected embeddings ( $x = (w_1…w_n)$ 6):

$x = (w_1…w_n)$ 7

where $x = (w_1…w_n)$ 8, $x = (w_1…w_n)$ 9, and $M$ 0 denotes the $M$ 1 negatives in a batch of $M$ 2 pairs.

2.4 Joint Optimization

TwHIN-BERT is optimized via a joint loss $M$ 3, with $M$ 4 in both base and large models. Training proceeds in two stages: initial MLM-only pre-training on 6B tweets (500K steps), followed by joint MLM + social objective on 1B tweets with engagement logs (500K steps).

3. Training Data and Multilingual Design

TwHIN-BERT is pre-trained on a corpus of 7 billion tweets (collected January 2020–June 2022), spanning over 100 languages as determined by fastText language ID. Resampling by frequency $M$ 5 is employed to up-weight low-resource languages. Of these, 1B tweets possess full engagement logs, contributed by approximately 200 million users and 100 billion engagement edges in TwHIN.

Preprocessing replicates XLM-R conventions: Unicode normalization, use of URL/mention placeholders, and language filtering via lid.176.bin. Multilingual subword tokenization supports code-mixing, a common phenomenon in tweets.

4. Evaluation Tasks and Empirical Results

4.1 Downstream Benchmarks

TwHIN-BERT is benchmarked on three task types, each in 50 languages except where noted.

Social engagement prediction: For a user embedding $M$ 6 and LM-pooled tweet embedding $M$ 7, a learned link predictor classifies whether $M$ 8 would engage $M$ 9. HITS@10 is used (rank positive among 1,000 candidates).
Hashtag prediction: Predicts which of the top-500 hashtags occurs in a tweet. Macro-F1 is the metric.
Standard Tweet classification: Includes SemEval2017 (English sentiment, avg recall), SemEval2018 (English/Spanish emoji, macro-F1), ASAD (Arabic sentiment, avg recall), COVID-JA (Japanese topic, accuracy), and SemEval2020 (Hindi/English and Spanish/English sentiment, accuracy).

4.2 Baseline Models

Evaluation compares TwHIN-BERT to:

BERTweet: English-only, Twitter-corpus RoBERTa.
mBERT: Multilingual BERT trained on Wikipedia.
XLM-R: Multilingual RoBERTa on CommonCrawl.
XLM-T: XLM-R further pre-trained on 200M tweets.
TwHIN-BERT-MLM: MLM-only ablation of TwHIN-BERT trained on 7B tweets.

4.3 Empirical Results

Task	Model	Metric	Score
Engagement Prediction	mBERT	HITS@10	0.0732
	XLM-R	HITS@10	0.0849
	XLM-T	HITS@10	0.1043
	TwHIN-BERT-MLM	HITS@10	0.1161
	TwHIN-BERT	HITS@10	0.1436
	TwHIN-BERT-large	HITS@10	0.1497
Hashtag Prediction	mBERT	macro-F1	50.05
	XLM-R	macro-F1	50.86
	XLM-T	macro-F1	51.74
	TwHIN-BERT-MLM	macro-F1	53.66
	TwHIN-BERT	macro-F1	54.62
	TwHIN-BERT-large	macro-F1	55.23
Tweet Classification	mBERT	avg metric	53.51
	XLM-R	avg metric	57.49
	XLM-T	avg metric	58.52
	TwHIN-BERT-MLM	avg metric	59.00
	TwHIN-BERT	avg metric	59.38
	TwHIN-BERT-large	avg metric	60.06

TwHIN-BERT demonstrates a significant performance increase—37.6% relative improvement over XLM-T on social engagement prediction, 5.6% on hashtag prediction, and 1.5% on classification tasks, measured as macro average (Zhang et al., 2022).

4.4 Ablation

Removal of the social loss (TwHIN-BERT-MLM) causes a marked drop in social task metrics (e.g., HITS@10 drops from 0.1436 to 0.1161), with smaller decreases on semantic classification. This supports the claim that social pre-training primarily strengthens social recommendation while maintaining or modestly improving semantic understanding.

5. Open-Source Resources

TwHIN-BERT and associated resources are provided for research and downstream deployment:

Model checkpoints and code: GitHub, HuggingFace (base), HuggingFace (large).
Benchmark datasets: Multilingual hashtag prediction and social engagement prediction (50 languages each) via public API or archived splits.

Immediate reproduction, fine-tuning, and extension of socially-aware tweet representations are supported through these resources.

TwHIN-BERT establishes a new standard for integrating social-graph structure with PLM pre-training for microposts, directly addressing the disconnect between standard PLMs and noisy, short-form, social user-generated content. While prior techniques such as continual tweet pre-training (XLM-T) improved over static PLMs, the explicit incorporation of heterogeneous social engagement objectives distinguishes TwHIN-BERT. The approach leverages implicit, large-scale "ground truth" from real user behavior, harnessing social homophily and engagement-induced semantic affinity, yielding improved representations along both social and semantic axes. This enables more effective downstream systems for social recommendation and understanding in a multilingual, real-world setting (Zhang et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations at Twitter (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TwHIN-BERT.