Taxonomy of T-PTLMs
- T-PTLMs are transformer-based pretrained language models organized by pretraining corpus, architecture, SSL objectives, and model extensions.
- This taxonomy clearly defines categories such as general-domain, social-media, language-based, and domain-specific models with distinct operational contexts.
- It details methodological aspects including generative, contrastive, adversarial, and hybrid SSL objectives alongside unique embedding and adaptation strategies.
Transformer-based Pretrained LLMs (T-PTLMs) form the backbone of modern natural language processing, leveraging the transformer architecture and large-scale self-supervised pretraining to yield universal language representations. A comprehensive taxonomy of T-PTLMs organizes this landscape along four principal axes: pretraining corpus, architecture, self-supervised learning (SSL) objective, and model extensions and variants. Distinct subfamilies are further determined by embedding schemes, position encodings, and downstream adaptation strategies. This layered taxonomy enables precise differentiation among T-PTLM classes, supports systematic evaluation, and facilitates principled extension to new domains and tasks (Kalyan et al., 2021).
1. Taxonomy Overview and Primary Dimensions
T-PTLMs are systematically categorized along four interlocking dimensions:
- Pretraining-Corpus Based: Specifies source and domain characteristics of the corpora (general-domain, social-media, language-based—monolingual/multilingual, domain-specific).
- Architecture: Distinguishes Transformer variants as encoder-only, decoder-only, or encoder–decoder (Seq2Seq) models.
- Self-Supervised Learning Type: Divides models by main SSL objectives (generative, contrastive, adversarial, hybrid).
- Extensions and Variants: Details compactness, character-based input, adaptation efficiency, sentence representation, tokenization strategies, model scale, knowledge augmentation, long-sequence handling, and architectural efficiency (Kalyan et al., 2021).
These axes, with associated representative models, objectives, and adaptation methods, define the major branches and subcategories of T-PTLMs.
2. Pretraining-Corpus-Based Categories
General-Domain T-PTLMs
General-domain models are pretrained on heterogeneous text (e.g., Wikipedia, BookCorpus, Common Crawl), offering broad linguistic coverage without specialization in domain vocabulary. Iconic examples include BERT (encoder, MLM+NSP), RoBERTa (encoder, dynamic MLM), XLNet (encoder, PLM), ELECTRA (encoder, RTD), GPT-2/3 (decoder, CLM), T5 (encoder–decoder, Seq2SeqLM), and BART (encoder–decoder, DAE).
Representative objectives include:
- Causal LM:
- Masked LM:
- Next Sentence Prediction:
Social-Media
Models in this category address the linguistic noise and colloquialism in texts from platforms like Twitter and Reddit. Notable models include Bertweet (850M tweets), RoBERTa-Twitter (60M tweets), HateBERT (Reddit), and CT-BERT (COVID-19 tweets).
Language-Based
- Monolingual: Trained on a single non-English language (e.g., AraBERT for Arabic, PhoBERT for Vietnamese, CamemBERT for French).
- Multilingual: Joint training across numerous languages with shared subword vocabulary and a focus on cross-lingual transfer (e.g., mBERT, XLM-R, mT5, XLM).
Domain-Specific
Specialized for domains such as biomedical (BioBERT, PubMedBERT), finance (FinBERT), legal (LegalBERT), programming (CodeBERT), or scientific text (SciBERT). These models feature domain-centric vocabularies and often exploit continual pretraining (CPT) or from-scratch pretraining on targeted corpora.
3. Architectural Variants
Encoder-Only
Consists of stacks of bidirectional transformer encoders; primarily supports NLU tasks such as classification or extraction. Examples: BERT, RoBERTa, XLNet, ALBERT, ELECTRA, XLM-R.
Decoder-Only
Employs stacks of masked self-attention transformer decoders for autoregressive NLG and language modeling. Examples include GPT-1/2/3 and the decoder segment in XLNet.
Encoder–Decoder (Seq2Seq)
Combines encoder stacks feeding into decoder stacks, enabling generic text-to-text mapping required in translation and summarization tasks. Models in this group include MASS, BART, T5, mBART, mT5, PEGASUS.
4. Self-Supervised Learning Types
Generative SSL
Encompasses Causal LM, Masked LM, Translation LM (TLM), Seq2SeqLM, and Denoising Autoencoder (DAE) losses. These comprise the backbone of representation learning.
Contrastive SSL
Sentence-level classification tasks such as NSP and SOP, formalized as , promote coherence and semantic discrimination.
Adversarial SSL
Token-level discriminative objectives, commonly Replaced Token Detection (RTD) as in ELECTRA, Random Token Substitution (RTS), and Shuffled Token Detection (STD). The general loss takes the form .
Hybrid SSL
Combines different SSL paradigms. BERT utilizes MLM (generative) plus NSP (contrastive). InfoXLM incorporates MLM, TLM, and a cross-lingual contrastive loss.
5. Extensions and Model Variants
A diverse array of functional extensions and structural adaptations further differentiates T-PTLMs:
- Compact Models: Utilize distillation, pruning, quantization for small-footprint deployment (DistilBERT, TinyBERT, MobileBERT, MiniLM).
- Character-Based Inputs: Ingest character sequences for improved OOV robustness (CharacterBERT, CharBERT, AlphaBERT).
- Green (Lightweight) Models: Enable rapid domain adaptation with minimal pretraining; expand/align vocabulary (GreenBioBERT, exBERT, E-BERT).
- Sentence Embedding Models: Produce static vectors suitable for clustering and retrieval; rely on NLI/STS-tuned contrastive objectives (SBERT, SimCSE, TSDAE).
- Tokenization-Free: Operate at the byte or character level, eliminating fixed subword vocabularies (CANINE, ByT5, Charformer).
- Large-Scale Models: Incorporate hundreds of billions to trillions of parameters (GPT-3, GShard, Switch-Transformer), allowing for emergent in-context learning.
- Knowledge-Enriched Models: Integrate structured KBs via specialized layers/objectives (KnowBERT, ERNIE, SenseBERT, SapBERT, UmlsBERT).
- Long-Sequence Models: Employ sparse or linearized attention for thousands of tokens (Longformer, BigBird, ETC, Reformer, Performer).
- Efficiency-Oriented Architectures: Innovate attention/core blocks to improve sample and compute efficiency (DeBERTa, ConvBERT).
6. Embedding Schemes, Position Encodings, and Adaptation Methods
Embedding Types
- Main: Subword (WordPiece, BPE, SentencePiece), Character (CharCNN, BiGRU), code embeddings for structured codes.
- Auxiliary: Absolute vs. relative positional encoding, segment embeddings (sentence pairs), language and entity-type embeddings.
Adaptation Methods
- Continual Pretraining (CPT): Further pretraining of pretrained weights on new domain corpora.
- Simultaneous Pretraining (SPT): Joint training on diverse corpora.
- Task-Adaptive Pretraining (TAPT): Further pretraining on unlabeled data specific to the target task.
- Knowledge Inherited Pretraining (KIPT): Combines SSL with knowledge distillation.
- Fine-tuning: Ranges from classic to parameter-efficient (adapters, pruning).
- Prompt-based Tuning: Applies both discrete and continuous prompt formulations.
7. Comparative Properties and Use Cases
A tabular summary highlights the principal features, typical model size, and predominant use-cases of major T-PTLM subfamilies (Kalyan et al., 2021):
| Category | Main Features | Typical Size |
|---|---|---|
| General-Domain | broad-domain corpora; MLM/CLM/DAE/PLM | 110M–340M (base/large) |
| Social-Media | adaptation to noisy/slang text | ≈100M |
| Monolingual | single-language corpora; domain vocab | 30M–150M |
| Multilingual | shared vocab; cross-lingual transfer | 250M–3B |
| Domain-Specific | domain vocab; CPT or scratch | 100M–300M |
| Encoder-Only | bidirectional; strong NLU | 100M–340M |
| Decoder-Only | autoregressive; strong NLG | 100M–175B |
| Encoder–Decoder | text→text; seq2seq | 220M–13B |
| Compact | distilled, pruned, quantized | 10M–100M |
| Character-Based | char or char+subword input | 100M+ |
| Sentence Embedding | fine-tuned on NLI/STS (contrastive) | 110M |
| Token-Free | byte/grad-learnt units | 100M–400M+ |
| Large-Scale | emergent, in-context learning | 175B–1.6T |
| Knowledge-Enriched | KB-injection layers/objectives | ~200M |
| Long-Sequence | sparse/linear attention | 125M–1B |
| Efficient Architecture | structural optimizations | 100M–300M |
Primary use-cases range from NLU, NLG, and language understanding in specific domains/languages, to document-scale comprehension, low-latency deployment, and robust tokenization.
This taxonomy, with explicit hierarchical categories, mathematical objectives, representative references, and model properties, provides a rigorous framework for understanding and extending T-PTLMs (Kalyan et al., 2021).