Pre-trained Models in NLP

Updated 14 September 2025

Pre-trained Models (PTMs) are machine learning models trained on large-scale, unlabeled datasets using self-supervised objectives to learn transferable language representations.
They leverage advanced architectures like Transformers and bidirectional LSTMs to produce context-sensitive embeddings that improve performance in downstream tasks.
PTMs drive a range of NLP applications including question answering, sentiment analysis, and machine translation while addressing challenges in efficiency, interpretability, and scalability.

Pre-trained models (PTMs) are machine learning models that have been trained on large, generic datasets with self-supervised or unsupervised objectives to obtain transferable representations, typically as feature extractors or end-to-end architectures for downstream tasks. In NLP, PTMs have catalyzed a paradigm shift by enabling general-purpose distributed representation learning, facilitating transfer learning, and consistently setting new performance benchmarks across a broad range of language understanding and generation applications.

1. Foundations of Language Representation Learning

PTMs serve as a foundation for language representation learning by embedding discrete symbols (words, subwords, or characters) in low-dimensional, continuous vector spaces. The central goal is to encode lexical, syntactic, semantic, and even factual knowledge into these vectors, enabling downstream tasks to leverage rich, general features learned from massive unlabelled corpora. Early approaches, such as Skip-Gram and GloVe, produced static, context-invariant embeddings. In contrast, modern PTMs employ deep neural encoders—such as bidirectional LSTMs, Transformer encoders, or decoder architectures—to learn context-sensitive representations.

For an input sequence $\{x_1, x_2, \ldots, x_T\}$ , a contextual PTM encoder computes

$[\mathbf{h}_1, \mathbf{h}_2, \ldots, \mathbf{h}_T] = f_{\text{enc}}(x_1, x_2, \ldots, x_T)$

where $\mathbf{h}_t$ reflects the context-dependent representation of token $x_t$ and $f_{\text{enc}}$ denotes the (potentially multi-layered) neural architecture. Non-contextual embeddings are simply lookups $e_x \in \mathbb{R}^{D_e}$ from a fixed vocabulary.

2. Taxonomy of Pre-trained Models

The survey introduces a four-dimensional taxonomy for PTMs, summarizing the landscape as follows:

Perspective	Description/Examples	Key Models
Representation Type	Non-contextual (static) vs. contextual (dynamic, contextualized)	Word2Vec, GloVe, BERT, GPT, ELMo
Architecture	LSTM/BiLM, Transformer Encoder, Transformer Decoder, Full Transformer	ELMo, CoVe, BERT, XLNet, GPT, BART, T5
Pre-training Task	LM, MLM, PLM, DAE, Contrastive/CTL	GPT (LM), BERT (MLM), XLNet (PLM), BART/MASS (DAE)
Extensions/Scenario	Knowledge-incorporation, multilinguality, modality, domain, efficiency	ERNIE, KnowBERT, mBERT, XLM-R, BioBERT, DistilBERT

Representation Type: Non-contextual word embeddings are assigned statically (no context adaptation). Contextual PTMs modify representations dynamically, influenced by both previous and future tokens (in the encoder), or solely by autoregressive context (decoder).
Architecture: LSTM-based models were the initial standard for context adaptation; Transformers—particularly encoder (BERT, RoBERTa, XLNet), decoder (GPT series), and encoder-decoder (MASS, BART, T5)—now dominate pre-training.
Pre-training Task Types:
- Language Modeling (LM): Maximizes $\prod_{t}p(x_t|x_1,\ldots,x_{t-1})$ in a left-to-right manner (e.g., GPT).
- Masked Language Modeling (MLM): Randomly masks tokens from the input sequence and predicts them from their context (e.g., BERT).
- Permuted LM (PLM): Predicts tokens in a randomly permuted order (e.g., XLNet).
- Denoising Autoencoder (DAE): Reconstructs corrupted input sequences after various types of corruption.
- Contrastive Learning (CTL): Discriminates between positive pairs and negative samples.
Extensions: Many PTMs add external knowledge (ERNIE, KnowBERT), handle multiple modalities, specialize to domains (BioBERT, SciBERT), or optimize efficiency (distillation, quantization, pruning, adapters).

3. Adaptation and Transfer to Downstream Tasks

A distinguishing feature of PTMs is their versatility for transfer learning. The dominant paradigm is two-stage: (1) pre-training on large-scale, generic data (unlabeled), (2) fine-tuning (or feature extraction) on a task-specific, labeled dataset.

Layer Selection: Most semantic features are present in the higher layers; combining layers with task-specific learned weights ( $\alpha_\ell$ in a softmax mixture) can optimally aggregate syntax-semantics.

$r_t = \gamma \sum_{\ell=1}^{L} \alpha_\ell \mathbf{h}_t^{(\ell)}$

Transfer Strategies:
- Feature Extraction: Freezes all PTM parameters, using the model as a task-agnostic encoder.
- Full Fine-tuning: Unfreezes and updates all parameters for the target task.
- Intermediate/Multitask Fine-Tuning: Transfers via a related intermediate task or adapts on a multi-task objective.
- Adapters: Inserts and tunes extra modules (often bottleneck MLPs) while keeping the PTM backbone fixed for parameter efficiency.
- Prompt-Based Tuning: Conditions a PTM on (discrete or continuous) prompts to reframe target tasks, often without updating PTM weights.

Task-specific loss functions (e.g., cross-entropy, distillation, contrastive loss) are used during these adaptation steps.

4. Directions for Future Research

Key challenges and areas for further development include:

Model Scaling: While scaling model depth and width yields improvements (e.g., Megatron-LM, Turing-NLG), more efficient architectures and parallel/distributed/mixed precision training techniques are crucial for tractability.
Efficient Architectures for Long Sequences: Transformers’ quadratic complexity limits input length; alternative architectures and neural architecture search seek to expand the context window.
Task-Oriented Pre-training & Model Compression: Pre-training objectives should be tailored to the actual downstream use-case, and compression (distillation, pruning, quantization) remains necessary for deployment to resource-constrained environments.
Parameter-Efficient Knowledge Transfer: Moving beyond full fine-tuning (which yields a separate model per task), light-weight adaptation (e.g., adapters, external memories) aims to support many tasks/domains from one backbone.
Interpretability and Robustness: Explaining PTM predictions and defending against adversarial manipulation are essential as models are increasingly deployed in sensitive applications.

5. Practical Applications Across NLP

PTMs have been successfully exploited in the following applications:

Task	PTM Approach	Notes/Extensions
Question Answering	BERT-style span prediction, multi-hop QA	Specialized pre/post-processing for extractive/abstractive QA
Sentiment Analysis	BERT, domain-adapted PTMs, SentiLR	Aspect-based sentiment; domain tuning
Named Entity Recognition	ELMo, BERT, domain-specific PTMs	General and biomedical/scientific entity extraction
Machine Translation	Seq2seq PTMs (MASS, mBART), encoder-decoder	PTM-based fine-tuning for both supervised/unsupervised
Summarization	BERTSUM, PEGASUS (gap sentence pre-training)	Extractive and abstractive summarization
Adversarial Attack/Defense	BERT-Attack, adversarial training	PTMs as targets and defenses for input perturbations

In all cases, the rich language modeling in PTMs provides a transferable foundation, often yielding new state-of-the-art results after downstream adaptation. Specialized models or objectives (e.g., in aspect sentiment or biomedical NER) further enhance application-specific performance.

6. Summary and Significance

Pre-trained models have transformed NLP by establishing a paradigm in which language representations acquired through unsupervised/self-supervised objectives are subsequently reused in downstream, supervised settings. The field is characterized by:

A robust taxonomy highlighting architecture, task objective, and specializations.
Transfer learning frameworks that decouple knowledge acquisition from task adaptation.
Empirical evidence that PTMs achieve consistently superior results across NLP tasks—especially when the adaptation strategies judiciously select layers, use parameter-efficient tuning, and balance task-specific supervision.
Ongoing challenges in architectural efficiency, knowledge transfer, interpretability, robustness, and scalability—driving vibrant research in model design, compression, and explainability.

The result is an ecosystem where model reuse, systematic transfer, and domain adaptation are foundational, with PTMs serving as the de facto backbone for modern NLP research and applications (Qiu et al., 2020).

PDF Markdown Chat (Pro)

References (1)

Pre-trained Models for Natural Language Processing: A Survey (2020)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Pre-trained Models (PTMs).