Pre-trained Language Models

Updated 10 September 2025

Pre-trained Language Models are self-supervised systems built on Transformer architectures that generate contextual representations from large unlabeled text corpora.
They employ techniques such as Masked Language Modeling and Autoregressive Modeling, enabling fine-tuning or prompt-based adaptation for various downstream tasks.
Advanced PLMs scale to billions of parameters, adapt to domain-specific data, and integrate external knowledge to enhance reasoning, interpretability, and performance.

Pre-trained LLMs (PLMs) are self-supervised learning systems built on large-scale neural architectures—most commonly Transformers—trained by predicting masked or next tokens over very large unlabeled text corpora. Such models, including BERT, GPT, RoBERTa, T5, and their numerous domain and language-specific variants, have become the backbone of modern NLP, offering contextual representations that can be leveraged for diverse downstream tasks ranging from classification and summarization to structured prediction, code synthesis, domain adaptation, and even multimodal or multilingual processing.

1. Core Methodologies and Learning Paradigms

PLMs are primarily characterized by their pre-training objectives and neural architectures. While encoder-only (e.g., BERT), decoder-only (e.g., GPT), and encoder-decoder (e.g., T5, BART) variants exist, the core methodologies share common Self-Supervised Learning (SSL) objectives, such as Masked Language Modeling (MLM), Causal/Autoregressive Modeling (AM), Next Sentence Prediction (NSP), and Replaced Token Detection (RTD) (Hu et al., 2022, Wang et al., 2021).

Typical loss functions include:

Masked Language Modeling:

$\mathcal{L}_{\text{MLM}} = - \sum_{\hat{x}\in M(\mathbf{X})} \log\, p\left(\hat{x}\mid\mathbf{X}_{\setminus M(\mathbf{X})}\right)$

Autoregressive Modeling:

$\mathcal{L}_{\text{AM}} = - \sum_{t=1}^T \log\, p(x_t \mid \mathbf{X}_{<t})$

PLMs also employ tokenization schemes such as WordPiece or SentencePiece, with subword vocabularies that enable open-vocabulary processing, often with sizes of 25,000–50,000 tokens (e.g., SentencePiece for Lao (Lin et al., 2021)). The pre-training phase incorporates large batches (sometimes 128 GPU-scale (Wu et al., 2021)), long sequences, and hundreds of millions to billions of tokens.

The downstream adaptation typically occurs via fine-tuning, prompt-based learning, or transfer via adapters. Prompting—either with discrete or continuous prompts—has emerged as a method to “cast” various tasks as language modeling problems (see (Liu et al., 2023) for recommender systems). Fine-tuning strategies range from holistic adjustment of all parameters to prompt-tuning or partial updates.

2. Major Advancements and Architectures

PLMs have evolved along several axes:

Scale: Growth from millions to hundreds of billions of parameters (GPT-3, PaLM), driving emergent capabilities such as few-shot reasoning and in-context learning (Zhu et al., 2022).
Domain and Language Specialization: Biomedical (BioBERT, ClinicalBERT (Wang et al., 2021)), legal, Arabic (JABER, SABER (Ghaddar et al., 2022)), and low-resource languages (BERT/ELECTRA for Lao (Lin et al., 2021)) adapt the PLM paradigm with in-domain corpora and vocabulary.
Knowledge Integration: Knowledge-enhanced PLMs (KE-PLMs (Hu et al., 2022)) inject external knowledge via entity-linked input, knowledge graphs, or retrieval modules. Strategies include attention matrix modification, additional pre-training objectives, or auxiliary knowledge adapters.
Multimodality: Extensions to biological sequences (protein/DNA (Wang et al., 2021)), image-text pairs (vision–LLMs), and code (CodeBERT, GraphCodeBERT (Chen et al., 2022)) enable cross-modal generalization.

Research has also produced multilingual and cross-lingual PLMs (mBART, InfoXLM, Unicoder (Wu et al., 2021, Wang et al., 2023))—sometimes integrating formal semantic representations (DRSs) to bridge languages and increase transferability.

3. Applications Across Domains

PLMs are leveraged in diverse domains with customized architectures and adaptation protocols:

Recommendation Systems: Replacing shallow encoders with PLMs (BERT, RoBERTa, UniLM, InfoXLM), augmented by attention pooling for news recommendation yields substantial increases in engagement metrics (e.g., +8.53% CTR, +10.68% clicks post-deployment on Microsoft News (Wu et al., 2021)). Pooling strategies that use learnable attention weights outperform fixed pooling.
Biomedical NLP: Pre-training on biomedical text, EHRs, and biological sequences yields models used for NER, relation extraction, question answering, and summarization, often achieving state-of-the-art in information extraction, inferencing, and archival retrieval (Wang et al., 2021).
Machine Translation and Speech: PLMs—both “small” (<1B params) and extra-large—are used for translation and ASR, but cost–benefit analysis in domain-specific adaptation often favors fine-tuned medium models over massive architectures after sufficent adaptation (Han et al., 2022, Krishnan et al., 2023). N-gram approximations derived from PLMs improve language modeling efficiency for ASR.
Programming Languages: For code summarization/search in low-resource languages, monolingual PLMs (sometimes trained with selectively similar high-resource code corpora) can outperform multilingual ones in terms of performance-to-time ratio (Chen et al., 2022).
Structured Reasoning and QA: PLMs serve as the backbone of simple KGQA frameworks, with knowledge-enhanced and distilled models (e.g., LUKE, KEPLER, DistilBERT) achieving the best balance of accuracy and efficiency. Zero-shot capabilities of prompt-based PLMs (e.g., ChatGPT) are competitive on some tasks but can be limited by entity linking and precise fact retrieval (Hu et al., 2023).

4. Training Efficiency, Knowledge Transfer, and Scaling Challenges

Training large PLMs entails substantial computational burdens, yet recent approaches improve both efficiency and adaptability:

Knowledge Inheritance: The KI framework (Qin et al., 2021) introduces an auxiliary knowledge distillation loss during pre-training,

$\mathcal{L}(\mathcal{D}_L; M_S) = \sum_{(x, y)\in\mathcal{D}_L} [(1-\alpha_t)\mathcal{L}_{SELF}(x, y) + \alpha_t \tau^2 \mathrm{KL}(\mathbb{P}_{M_S}(x;\tau) \Vert \mathbb{P}_{M_L}(x;\tau))]$

where the inheritance rate $\alpha_t$ decays during training, guiding the student with the teacher's “dark knowledge.” This reduces computational cost (up to 44% fewer steps) and increases data/sample efficiency, supporting sequential and multi-domain inheritance.

Scaling Laws and the Impossible Triangle: Attaining moderate size, state-of-the-art few-shot learning, and fine-tuning capability simultaneously remains elusive (Zhu et al., 2022). Knowledge distillation, prompt learning, and data augmentation are active strategies to mitigate this, but trade-offs (size vs. adaptivity) persist.
Optimization of Layer Usage: Only a subset (typically the last two layers) of large PLMs need fine-tuning for strong downstream performance, minimizing resource consumption (Wu et al., 2021). Model design that allocates parameter “budget” to favor encoder depth over decoder yields better sequence-to-sequence models for tasks like keyphrase generation (Wu et al., 2022).
Domain Adaptation: KI and related methods enable efficient transfer and continual learning, particularly beneficial under limited data or for privacy constraints (Qin et al., 2021).

5. Explainability, Interpretability, and Rationalization

PLMs function as black boxes, spurring a major research thrust toward explainable AI:

Selective Rationalization: PLMR (Yuan et al., 3 Jan 2025) addresses the degeneration and failure of previous rationalization frameworks in PLMs due to token representation homogeneity. By extracting rationales from earlier transformer layers (which maintain heterogeneity), augmenting with pruning layers (“Dim-Reduction”), and regularizing against full-text predictions, PLMR enhances both the informativeness and accuracy of rationales.

$R = M \odot X, \quad M = g(X) = \mathrm{MLP}(\text{Transformer}_{0:l}(X))$

The generator–predictor split, together with matching regularization (alignment between rationale-based and full-text predictions), achieves upwards of 17% F1 improvement in rationale selection versus previous PLM-based methods.

Layer-wise Locality of Contextualization: Contextualization—transformation of word meaning by context—is localized within self-attention sub-layers of mid-to-upper transformer layers, with residual connections in output sub-layers attenuating this effect (Vijayakumar et al., 2023). This layer analysis informs interpretability and model optimization for downstream extraction tasks.

6. Knowledge Utilization and Enhancement

PLMs can encode vast latent knowledge, but exploitation of this knowledge remains incomplete:

Knowledge Rumination (Yao et al., 2023) demonstrates that the latent knowledge in PLM parameters can be “activated” through strategically constructed prompts (“As far as I know...”), extracting representations which are concatenated or projected via the feed-forward network for knowledge consolidation. This procedure yields notable improvements in commonsense reasoning tasks and GLUE benchmarks, outperforming vanilla fine-tuning and even some external knowledge-injection baselines.
Knowledge-Enhanced PLMs: Taxonomies distinguish KE-PLMs by type of external information (linguistic, text, KG, rules) and injection methodology (e.g., attention modification, retrieval augmentation), enabling reasoning and fact-based performance beyond what conventional PLMs achieve (Hu et al., 2022). For NLG, approaches are categorized into KG-based and retrieval-based, with pipelines encompassing retrieve–rerank–rewrite and path reasoning.

7. Limitations, Evaluation, and Future Directions

Despite their impact, PLMs present persistent challenges:

Resource requirements (compute, memory) are a barrier to adoption, particularly for extra-large models and in low-resource or domain-specific contexts (Zhu et al., 2022, Han et al., 2022).
Reasoning and Knowledge Gaps: Lack of causal reasoning, susceptibility to spurious correlations, and inability to easily update world knowledge limit reliability—especially in knowledge-intensive and high-stakes domains (Hu et al., 2022, Wang et al., 2021).
Explainability and Trust: Black-box decision making, homogeneity of token representations, and insufficiently informative rationales curtail deployment in sensitive applications (Yuan et al., 3 Jan 2025).
Ethics, Privacy, and Bias: PLMs can propagate biases present in training corpora, raising concerns for fairness, accountability, and privacy, e.g., in urban research or clinical settings (Fu et al., 2023, Wang et al., 2021).

Standardization efforts—in datasets (e.g., BLURB, ALUE, BLUE), task protocols, and vocabulary—have accelerated progress and comparability (Wang et al., 2021, Ghaddar et al., 2022). Future directions include efficient lifelong and continual learning, multi-modal and cross-lingual expansion, principled integration of heterogeneous or dynamic knowledge sources, and improved, robust interpretability.

This overview synthesizes advances, limitations, and methodologies for PLMs, integrating key findings on architecture, adaptation, knowledge handling, interpretability, and evaluation, with references to recent empirical and survey research from the arXiv corpus.