Papers
Topics
Authors
Recent
Search
2000 character limit reached

Taxonomy of T-PTLMs

Updated 20 May 2026
  • T-PTLMs are transformer-based pretrained language models organized by pretraining corpus, architecture, SSL objectives, and model extensions.
  • This taxonomy clearly defines categories such as general-domain, social-media, language-based, and domain-specific models with distinct operational contexts.
  • It details methodological aspects including generative, contrastive, adversarial, and hybrid SSL objectives alongside unique embedding and adaptation strategies.

Transformer-based Pretrained LLMs (T-PTLMs) form the backbone of modern natural language processing, leveraging the transformer architecture and large-scale self-supervised pretraining to yield universal language representations. A comprehensive taxonomy of T-PTLMs organizes this landscape along four principal axes: pretraining corpus, architecture, self-supervised learning (SSL) objective, and model extensions and variants. Distinct subfamilies are further determined by embedding schemes, position encodings, and downstream adaptation strategies. This layered taxonomy enables precise differentiation among T-PTLM classes, supports systematic evaluation, and facilitates principled extension to new domains and tasks (Kalyan et al., 2021).

1. Taxonomy Overview and Primary Dimensions

T-PTLMs are systematically categorized along four interlocking dimensions:

  1. Pretraining-Corpus Based: Specifies source and domain characteristics of the corpora (general-domain, social-media, language-based—monolingual/multilingual, domain-specific).
  2. Architecture: Distinguishes Transformer variants as encoder-only, decoder-only, or encoder–decoder (Seq2Seq) models.
  3. Self-Supervised Learning Type: Divides models by main SSL objectives (generative, contrastive, adversarial, hybrid).
  4. Extensions and Variants: Details compactness, character-based input, adaptation efficiency, sentence representation, tokenization strategies, model scale, knowledge augmentation, long-sequence handling, and architectural efficiency (Kalyan et al., 2021).

These axes, with associated representative models, objectives, and adaptation methods, define the major branches and subcategories of T-PTLMs.

2. Pretraining-Corpus-Based Categories

General-Domain T-PTLMs

General-domain models are pretrained on heterogeneous text (e.g., Wikipedia, BookCorpus, Common Crawl), offering broad linguistic coverage without specialization in domain vocabulary. Iconic examples include BERT (encoder, MLM+NSP), RoBERTa (encoder, dynamic MLM), XLNet (encoder, PLM), ELECTRA (encoder, RTD), GPT-2/3 (decoder, CLM), T5 (encoder–decoder, Seq2SeqLM), and BART (encoder–decoder, DAE).

Representative objectives include:

  • Causal LM: LCLM(x)=1xi=1xlogP(xix<i)\mathcal{L}_{\mathrm{CLM}}(x) = -\frac{1}{|x|}\sum_{i=1}^{|x|}\log P(x_i \mid x_{<i})
  • Masked LM: LMLM(x)=1MiMlogP(xixM)\mathcal{L}_{\mathrm{MLM}}(x) = -\frac{1}{|M|}\sum_{i\in M}\log P(x_i \mid x_{\setminus M})
  • Next Sentence Prediction: LNSP(x,y)=logP(dx,y), d{0,1}\mathcal{L}_{\mathrm{NSP}}(x, y) = -\log P(d \mid x, y),\ d \in \{0,1\}

Social-Media

Models in this category address the linguistic noise and colloquialism in texts from platforms like Twitter and Reddit. Notable models include Bertweet (850M tweets), RoBERTa-Twitter (60M tweets), HateBERT (Reddit), and CT-BERT (COVID-19 tweets).

Language-Based

  • Monolingual: Trained on a single non-English language (e.g., AraBERT for Arabic, PhoBERT for Vietnamese, CamemBERT for French).
  • Multilingual: Joint training across numerous languages with shared subword vocabulary and a focus on cross-lingual transfer (e.g., mBERT, XLM-R, mT5, XLM).

Domain-Specific

Specialized for domains such as biomedical (BioBERT, PubMedBERT), finance (FinBERT), legal (LegalBERT), programming (CodeBERT), or scientific text (SciBERT). These models feature domain-centric vocabularies and often exploit continual pretraining (CPT) or from-scratch pretraining on targeted corpora.

3. Architectural Variants

Encoder-Only

Consists of stacks of bidirectional transformer encoders; primarily supports NLU tasks such as classification or extraction. Examples: BERT, RoBERTa, XLNet, ALBERT, ELECTRA, XLM-R.

Decoder-Only

Employs stacks of masked self-attention transformer decoders for autoregressive NLG and language modeling. Examples include GPT-1/2/3 and the decoder segment in XLNet.

Encoder–Decoder (Seq2Seq)

Combines encoder stacks feeding into decoder stacks, enabling generic text-to-text mapping required in translation and summarization tasks. Models in this group include MASS, BART, T5, mBART, mT5, PEGASUS.

4. Self-Supervised Learning Types

Generative SSL

Encompasses Causal LM, Masked LM, Translation LM (TLM), Seq2SeqLM, and Denoising Autoencoder (DAE) losses. These comprise the backbone of representation learning.

Contrastive SSL

Sentence-level classification tasks such as NSP and SOP, formalized as LSOP(x,y)=logP(dx,y)\mathcal{L}_{\mathrm{SOP}}(x,y) = -\log P(d\mid x,y), promote coherence and semantic discrimination.

Adversarial SSL

Token-level discriminative objectives, commonly Replaced Token Detection (RTD) as in ELECTRA, Random Token Substitution (RTS), and Shuffled Token Detection (STD). The general loss takes the form LRTD(x^)=1x^ilogP(dix^i)\mathcal{L}_{\mathrm{RTD}}(\hat x) = -\frac{1}{|\hat x|}\sum_{i}\log P(d_i\mid \hat x_i).

Hybrid SSL

Combines different SSL paradigms. BERT utilizes MLM (generative) plus NSP (contrastive). InfoXLM incorporates MLM, TLM, and a cross-lingual contrastive loss.

5. Extensions and Model Variants

A diverse array of functional extensions and structural adaptations further differentiates T-PTLMs:

  • Compact Models: Utilize distillation, pruning, quantization for small-footprint deployment (DistilBERT, TinyBERT, MobileBERT, MiniLM).
  • Character-Based Inputs: Ingest character sequences for improved OOV robustness (CharacterBERT, CharBERT, AlphaBERT).
  • Green (Lightweight) Models: Enable rapid domain adaptation with minimal pretraining; expand/align vocabulary (GreenBioBERT, exBERT, E-BERT).
  • Sentence Embedding Models: Produce static vectors suitable for clustering and retrieval; rely on NLI/STS-tuned contrastive objectives (SBERT, SimCSE, TSDAE).
  • Tokenization-Free: Operate at the byte or character level, eliminating fixed subword vocabularies (CANINE, ByT5, Charformer).
  • Large-Scale Models: Incorporate hundreds of billions to trillions of parameters (GPT-3, GShard, Switch-Transformer), allowing for emergent in-context learning.
  • Knowledge-Enriched Models: Integrate structured KBs via specialized layers/objectives (KnowBERT, ERNIE, SenseBERT, SapBERT, UmlsBERT).
  • Long-Sequence Models: Employ sparse or linearized attention for thousands of tokens (Longformer, BigBird, ETC, Reformer, Performer).
  • Efficiency-Oriented Architectures: Innovate attention/core blocks to improve sample and compute efficiency (DeBERTa, ConvBERT).

6. Embedding Schemes, Position Encodings, and Adaptation Methods

Embedding Types

  • Main: Subword (WordPiece, BPE, SentencePiece), Character (CharCNN, BiGRU), code embeddings for structured codes.
  • Auxiliary: Absolute vs. relative positional encoding, segment embeddings (sentence pairs), language and entity-type embeddings.

Adaptation Methods

  • Continual Pretraining (CPT): Further pretraining of pretrained weights on new domain corpora.
  • Simultaneous Pretraining (SPT): Joint training on diverse corpora.
  • Task-Adaptive Pretraining (TAPT): Further pretraining on unlabeled data specific to the target task.
  • Knowledge Inherited Pretraining (KIPT): Combines SSL with knowledge distillation.
  • Fine-tuning: Ranges from classic to parameter-efficient (adapters, pruning).
  • Prompt-based Tuning: Applies both discrete and continuous prompt formulations.

7. Comparative Properties and Use Cases

A tabular summary highlights the principal features, typical model size, and predominant use-cases of major T-PTLM subfamilies (Kalyan et al., 2021):

Category Main Features Typical Size
General-Domain broad-domain corpora; MLM/CLM/DAE/PLM 110M–340M (base/large)
Social-Media adaptation to noisy/slang text ≈100M
Monolingual single-language corpora; domain vocab 30M–150M
Multilingual shared vocab; cross-lingual transfer 250M–3B
Domain-Specific domain vocab; CPT or scratch 100M–300M
Encoder-Only bidirectional; strong NLU 100M–340M
Decoder-Only autoregressive; strong NLG 100M–175B
Encoder–Decoder text→text; seq2seq 220M–13B
Compact distilled, pruned, quantized 10M–100M
Character-Based char or char+subword input 100M+
Sentence Embedding fine-tuned on NLI/STS (contrastive) 110M
Token-Free byte/grad-learnt units 100M–400M+
Large-Scale emergent, in-context learning 175B–1.6T
Knowledge-Enriched KB-injection layers/objectives ~200M
Long-Sequence sparse/linear attention 125M–1B
Efficient Architecture structural optimizations 100M–300M

Primary use-cases range from NLU, NLG, and language understanding in specific domains/languages, to document-scale comprehension, low-latency deployment, and robust tokenization.


This taxonomy, with explicit hierarchical categories, mathematical objectives, representative references, and model properties, provides a rigorous framework for understanding and extending T-PTLMs (Kalyan et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Taxonomy of T-PTLMs.