Taxonomy of T-PTLMs

Updated 20 May 2026

T-PTLMs are transformer-based pretrained language models organized by pretraining corpus, architecture, SSL objectives, and model extensions.
This taxonomy clearly defines categories such as general-domain, social-media, language-based, and domain-specific models with distinct operational contexts.
It details methodological aspects including generative, contrastive, adversarial, and hybrid SSL objectives alongside unique embedding and adaptation strategies.

Transformer-based Pretrained LLMs (T-PTLMs) form the backbone of modern natural language processing, leveraging the transformer architecture and large-scale self-supervised pretraining to yield universal language representations. A comprehensive taxonomy of T-PTLMs organizes this landscape along four principal axes: pretraining corpus, architecture, self-supervised learning (SSL) objective, and model extensions and variants. Distinct subfamilies are further determined by embedding schemes, position encodings, and downstream adaptation strategies. This layered taxonomy enables precise differentiation among T-PTLM classes, supports systematic evaluation, and facilitates principled extension to new domains and tasks (Kalyan et al., 2021).

1. Taxonomy Overview and Primary Dimensions

T-PTLMs are systematically categorized along four interlocking dimensions:

Pretraining-Corpus Based: Specifies source and domain characteristics of the corpora (general-domain, social-media, language-based—monolingual/multilingual, domain-specific).
Architecture: Distinguishes Transformer variants as encoder-only, decoder-only, or encoder–decoder (Seq2Seq) models.
Self-Supervised Learning Type: Divides models by main SSL objectives (generative, contrastive, adversarial, hybrid).
Extensions and Variants: Details compactness, character-based input, adaptation efficiency, sentence representation, tokenization strategies, model scale, knowledge augmentation, long-sequence handling, and architectural efficiency (Kalyan et al., 2021).

These axes, with associated representative models, objectives, and adaptation methods, define the major branches and subcategories of T-PTLMs.

2. Pretraining-Corpus-Based Categories

General-Domain T-PTLMs

General-domain models are pretrained on heterogeneous text (e.g., Wikipedia, BookCorpus, Common Crawl), offering broad linguistic coverage without specialization in domain vocabulary. Iconic examples include BERT (encoder, MLM+NSP), RoBERTa (encoder, dynamic MLM), XLNet (encoder, PLM), ELECTRA (encoder, RTD), GPT-2/3 (decoder, CLM), T5 (encoder–decoder, Seq2SeqLM), and BART (encoder–decoder, DAE).

Representative objectives include:

Causal LM: $\mathcal{L}_{\mathrm{CLM}}(x) = -\frac{1}{|x|}\sum_{i=1}^{|x|}\log P(x_i \mid x_{<i})$
Masked LM: $\mathcal{L}_{\mathrm{MLM}}(x) = -\frac{1}{|M|}\sum_{i\in M}\log P(x_i \mid x_{\setminus M})$
Next Sentence Prediction: $\mathcal{L}_{\mathrm{NSP}}(x, y) = -\log P(d \mid x, y),\ d \in \{0,1\}$

Models in this category address the linguistic noise and colloquialism in texts from platforms like Twitter and Reddit. Notable models include Bertweet (850M tweets), RoBERTa-Twitter (60M tweets), HateBERT (Reddit), and CT-BERT (COVID-19 tweets).

Language-Based

Monolingual: Trained on a single non-English language (e.g., AraBERT for Arabic, PhoBERT for Vietnamese, CamemBERT for French).
Multilingual: Joint training across numerous languages with shared subword vocabulary and a focus on cross-lingual transfer (e.g., mBERT, XLM-R, mT5, XLM).

Domain-Specific

Specialized for domains such as biomedical (BioBERT, PubMedBERT), finance (FinBERT), legal (LegalBERT), programming (CodeBERT), or scientific text (SciBERT). These models feature domain-centric vocabularies and often exploit continual pretraining (CPT) or from-scratch pretraining on targeted corpora.

3. Architectural Variants

Encoder-Only

Consists of stacks of bidirectional transformer encoders; primarily supports NLU tasks such as classification or extraction. Examples: BERT, RoBERTa, XLNet, ALBERT, ELECTRA, XLM-R.

Decoder-Only

Employs stacks of masked self-attention transformer decoders for autoregressive NLG and language modeling. Examples include GPT-1/2/3 and the decoder segment in XLNet.

Encoder–Decoder (Seq2Seq)

Combines encoder stacks feeding into decoder stacks, enabling generic text-to-text mapping required in translation and summarization tasks. Models in this group include MASS, BART, T5, mBART, mT5, PEGASUS.

4. Self-Supervised Learning Types

Generative SSL

Encompasses Causal LM, Masked LM, Translation LM (TLM), Seq2SeqLM, and Denoising Autoencoder (DAE) losses. These comprise the backbone of representation learning.

Contrastive SSL

Sentence-level classification tasks such as NSP and SOP, formalized as $\mathcal{L}_{\mathrm{SOP}}(x,y) = -\log P(d\mid x,y)$ , promote coherence and semantic discrimination.

Adversarial SSL

Token-level discriminative objectives, commonly Replaced Token Detection (RTD) as in ELECTRA, Random Token Substitution (RTS), and Shuffled Token Detection (STD). The general loss takes the form $\mathcal{L}_{\mathrm{RTD}}(\hat x) = -\frac{1}{|\hat x|}\sum_{i}\log P(d_i\mid \hat x_i)$ .

Hybrid SSL

Combines different SSL paradigms. BERT utilizes MLM (generative) plus NSP (contrastive). InfoXLM incorporates MLM, TLM, and a cross-lingual contrastive loss.

5. Extensions and Model Variants

A diverse array of functional extensions and structural adaptations further differentiates T-PTLMs:

Compact Models: Utilize distillation, pruning, quantization for small-footprint deployment (DistilBERT, TinyBERT, MobileBERT, MiniLM).
Character-Based Inputs: Ingest character sequences for improved OOV robustness (CharacterBERT, CharBERT, AlphaBERT).
Green (Lightweight) Models: Enable rapid domain adaptation with minimal pretraining; expand/align vocabulary (GreenBioBERT, exBERT, E-BERT).
Sentence Embedding Models: Produce static vectors suitable for clustering and retrieval; rely on NLI/STS-tuned contrastive objectives (SBERT, SimCSE, TSDAE).
Tokenization-Free: Operate at the byte or character level, eliminating fixed subword vocabularies (CANINE, ByT5, Charformer).
Large-Scale Models: Incorporate hundreds of billions to trillions of parameters (GPT-3, GShard, Switch-Transformer), allowing for emergent in-context learning.
Knowledge-Enriched Models: Integrate structured KBs via specialized layers/objectives (KnowBERT, ERNIE, SenseBERT, SapBERT, UmlsBERT).
Long-Sequence Models: Employ sparse or linearized attention for thousands of tokens (Longformer, BigBird, ETC, Reformer, Performer).
Efficiency-Oriented Architectures: Innovate attention/core blocks to improve sample and compute efficiency (DeBERTa, ConvBERT).

6. Embedding Schemes, Position Encodings, and Adaptation Methods

Embedding Types

Main: Subword (WordPiece, BPE, SentencePiece), Character (CharCNN, BiGRU), code embeddings for structured codes.
Auxiliary: Absolute vs. relative positional encoding, segment embeddings (sentence pairs), language and entity-type embeddings.

Adaptation Methods

Continual Pretraining (CPT): Further pretraining of pretrained weights on new domain corpora.
Simultaneous Pretraining (SPT): Joint training on diverse corpora.
Task-Adaptive Pretraining (TAPT): Further pretraining on unlabeled data specific to the target task.
Knowledge Inherited Pretraining (KIPT): Combines SSL with knowledge distillation.
Fine-tuning: Ranges from classic to parameter-efficient (adapters, pruning).
Prompt-based Tuning: Applies both discrete and continuous prompt formulations.

7. Comparative Properties and Use Cases

A tabular summary highlights the principal features, typical model size, and predominant use-cases of major T-PTLM subfamilies (Kalyan et al., 2021):

Category	Main Features	Typical Size
General-Domain	broad-domain corpora; MLM/CLM/DAE/PLM	110M–340M (base/large)
Social-Media	adaptation to noisy/slang text	≈100M
Monolingual	single-language corpora; domain vocab	30M–150M
Multilingual	shared vocab; cross-lingual transfer	250M–3B
Domain-Specific	domain vocab; CPT or scratch	100M–300M
Encoder-Only	bidirectional; strong NLU	100M–340M
Decoder-Only	autoregressive; strong NLG	100M–175B
Encoder–Decoder	text→text; seq2seq	220M–13B
Compact	distilled, pruned, quantized	10M–100M
Character-Based	char or char+subword input	100M+
Sentence Embedding	fine-tuned on NLI/STS (contrastive)	110M
Token-Free	byte/grad-learnt units	100M–400M+
Large-Scale	emergent, in-context learning	175B–1.6T
Knowledge-Enriched	KB-injection layers/objectives	~200M
Long-Sequence	sparse/linear attention	125M–1B
Efficient Architecture	structural optimizations	100M–300M

Primary use-cases range from NLU, NLG, and language understanding in specific domains/languages, to document-scale comprehension, low-latency deployment, and robust tokenization.

This taxonomy, with explicit hierarchical categories, mathematical objectives, representative references, and model properties, provides a rigorous framework for understanding and extending T-PTLMs (Kalyan et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Taxonomy of T-PTLMs.

Taxonomy of T-PTLMs

1. Taxonomy Overview and Primary Dimensions

2. Pretraining-Corpus-Based Categories

General-Domain T-PTLMs

Language-Based

Domain-Specific

3. Architectural Variants

Encoder-Only

Decoder-Only

Encoder–Decoder (Seq2Seq)

4. Self-Supervised Learning Types

Generative SSL

Contrastive SSL

Adversarial SSL

Hybrid SSL

5. Extensions and Model Variants

6. Embedding Schemes, Position Encodings, and Adaptation Methods

Embedding Types

Adaptation Methods

7. Comparative Properties and Use Cases

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Taxonomy of T-PTLMs

1. Taxonomy Overview and Primary Dimensions

2. Pretraining-Corpus-Based Categories

General-Domain T-PTLMs

Social-Media

Language-Based

Domain-Specific

3. Architectural Variants

Encoder-Only

Decoder-Only

Encoder–Decoder (Seq2Seq)

4. Self-Supervised Learning Types

Generative SSL

Contrastive SSL

Adversarial SSL

Hybrid SSL

5. Extensions and Model Variants

6. Embedding Schemes, Position Encodings, and Adaptation Methods

Embedding Types

Adaptation Methods

7. Comparative Properties and Use Cases

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research