Large Language Models

Updated 13 October 2025

Large language models are transformer-based neural networks that generate human-like text and perform a wide range of tasks without explicit task-specific supervision.
They evolved from statistical and early neural methods to models with billions of parameters using unsupervised pretraining and scaling laws.
Applications include natural language generation, code analysis, ASR, and scientific computing, although challenges in bias, resource use, and reliability remain.

A LLM is a transformer-based neural LLM trained on massive corpora of text, typically containing billions or even trillions of parameters. LLMs are capable of generating human-like text, following complex instructions, and performing a broad array of tasks—including language understanding, translation, summarization, code generation, retrieval, and reasoning—without requiring explicit task-specific supervision. The rapid evolution of LLMs has transformed NLP, enabling powerful new methodologies, revealing emergent abilities, and raising novel theoretical and practical challenges.

1. Historical Context and Model Evolution

The development of LLMs is rooted in a series of paradigm shifts within NLP. Early approaches relied on statistical n-gram models and hidden Markov models, which were overtaken by distributed word embeddings and neural architectures such as RNNs and LSTMs. The introduction of the transformer architecture—a deep stack of multi-head self-attention layers with feed-forward subnets, residual connections, and layer normalization—enabled efficient parallelism and long-context modeling, fundamentally altering the scalability and expressivity of LLMs (Douglas, 2023).

Model evolution proceeded through the GPT (Generative Pre-trained Transformer), BERT, and encoder-decoder lines (T5, PaLM). Exponential increases in parameter count (from millions in early GPT to hundreds of billions in GPT-3/PaLM-e) and dataset size (trillions of tokens) supported the emergence of transfer and in-context learning. This large-scale unsupervised pretraining, followed by fine-tuning or prompting, distinguishes LLMs from earlier models designed for single tasks.

The survey in (Ali et al., 27 Aug 2024) further classifies LLMs by architectural innovation, including dense transformer (GPT, LLaMA) and mixture-of-experts (MoE) models (e.g., CPM-2, GLaM), each enabling scaling along different hardware or data axes. The current generation of LLMs includes multi-lingual and domain-specific variants, such as BLOOM, CT-LLM, and MD-LLM-1, challenging the previous English-centric paradigm (Du et al., 5 Apr 2024, Ali et al., 27 Aug 2024, Murtada et al., 21 Jul 2025).

2. Core Principles and Model Architecture

LLMs are based on the transformer model, which alternates multi-head self-attention and feed-forward layers (Douglas, 2023). Input tokens are mapped to high-dimensional embeddings, sometimes augmented with position encodings (including rotary or absolute schemes). The attention mechanism computes query-key-weighted sums over representations:

$v_i = W \sum_{j} c_{i,j} u_j,\quad c_{i,j} = \frac{\exp(u_i \cdot B \cdot u_j)}{\sum_j \exp(u_i \cdot B \cdot u_j)}$

where $u_j$ are the input tokens' representations, $B$ is a learnable parameter, and $W$ is a linear projection.

Transformers scale efficiently to long contexts and large numbers of layers due to parallelizable computations and residual connections. Various extensions augment basic transformers with mechanisms for grouping queries, multi-query attention, activation variants (SwiGLU), advanced pretraining objectives, or memory architectures (Douglas, 2023, Du et al., 5 Apr 2024, Gokden, 22 Oct 2024).

LLMs are trained to minimize the autoregressive next-token prediction loss:

$\mathcal{L} = -\sum_{n=1}^N \log p(w_n \mid w_{1}, ..., w_{n-1})$

which empirically results in the learning of structured linguistic, semantic, and world knowledge representations. Scaling laws observed in LLMs indicate per-token loss improves with model size and training data as a power law, controlled by critical data and parameter thresholds.

3. Methodologies, Programming, and Modular Algorithms

Contemporary research explores embedding frozen LLMs within external algorithms for orchestrated, compositional reasoning. The “LLM Programs” framework (Schlag et al., 2023) decomposes complex tasks (e.g., evidence-based QA) into modular subproblems governed by classical control flow (loops, recursion), each step invoking the LLM with a focused prompt and context.

Key steps in this methodology:

Filtering phase: Given a question $q$ and candidate evidence paragraphs $P$ , filter for relevance using average negative log-likelihood (NLL):

$\mathrm{nll}(x) = -\frac{\log p(x)}{\mathrm{len}(x)}$

Select the top- $n$ supportive paragraphs.

Tree search: Reasoning chains are constructed iteratively, with each substep conditioned on the filtered evidence and ranked using NLL-based delta metrics (e.g., $\Delta_1 = \mathrm{nll}(p(S|Q,P)) - \mathrm{nll}(p(S|Q))$ ).
Intermediate output ranking: Modularization allows systematic evaluation and ranking of sub-chain hypotheses.

In evidence-supported question answering on the StrategyQA dataset, this programmatic decomposition yielded a $6.4\%$ absolute improvement in accuracy over a standard chain-of-thought approach, and access to "golden facts" raised accuracies above 81% (Schlag et al., 2023).

Other research applies similar decomposition to tasks in ASR (Chen et al., 2023, Cohen et al., 4 Aug 2025), molecular therapeutics (Chaves et al., 10 Jun 2024), and molecular dynamics (Murtada et al., 21 Jul 2025), using LLMs as flexible, context-sensitive sub-modules within broader algorithmic frameworks.

4. Applications Across Domains

LLMs have enabled significant advances across a spectrum of NLP and scientific domains:

Natural Language Understanding and Generation: LLMs power dialogue agents, summarization, retrieval-augmented QA, and affective computing (Douglas, 2023, Chaves et al., 10 Jun 2024).
Code Analysis and Security: Models such as CodeT5 and GPT-4 mini are fine-tuned for malware detection and code vulnerability analysis, outperforming classical static analysis by modeling both syntax and semantic anomalies (Jelodar et al., 7 Apr 2025).
Automatic Speech Recognition (ASR): LLMs are used to rescore or guide decoders, reducing word error rates (WER) and salient term error rates (STER) in long-form speech (Chen et al., 2023, Cohen et al., 4 Aug 2025), and are extended to multi-talker and attribute-conditioned transcription (Meng et al., 13 Sep 2024).
Embeddings and Retrieval: Universal embedder LLMs offer a unified embedding space for multilingual and cross-domain retrieval and classification (Zhang et al., 2023). The contrastive training and token extraction strategies demonstrate strong transfer to languages and domains lacking annotated data.
Specialized Sciences: MD-LLM-1 (Murtada et al., 21 Jul 2025) leverages LLM architectures to model protein conformational landscapes, indicating the ability of these models to discover rare or unseen molecular states. In drug discovery, the Tx-LLM model shows positive transfer across molecular modalities in property prediction and retrosynthesis (Chaves et al., 10 Jun 2024).

LLMs are increasingly positioned as sophisticated interfaces for data pipelines, integrating with explainability (XAI), AutoML, and knowledge graph systems to drive big-data analytics and cross-disciplinary workflows (Junior et al., 6 Jun 2024).

5. Internal Representations, Memory, and Semantic Alignment

Significant current research focuses on the internal mechanisms of LLMs:

Memory: LLMs demonstrate functional memory effects analogous to human cognition, including primacy/recency effects, interference patterns, elaborative rehearsal, and learning dynamics, despite not possessing explicit memory modules (Janik, 2023). This behavior is learned from statistical regularities in training corpora, and model architectural decisions (e.g., positional encoding) interact with data-induced memory phenomena.
Semantic Alignment and Multilingualism: Multilingual LLMs (e.g., BLOOM, LLaMA2) develop a “Lingua Franca” latent space where semantically identical texts in different languages evoke similar activation patterns (Zeng et al., 15 Oct 2024). With continued training and scaling, key linguistic processing neurons become increasingly concentrated in the early and final layers; semantic signal dominates mid-layer activations. Metrics such as Semantic Alignment Development Score (SADS) quantify this evolution, facilitating robust cross-lingual transfer and reasoning.
Deductive and Inductive Outputs: PLDR-LLMs introduce explicit, interpretable internal outputs (e.g., learned metric tensors, power law coefficients, energy-curvature tensors) that can be regularized through specialized losses (e.g., DAG loss) for better interpretability and, in some cases, improved generalization in zero- and few-shot settings (Gokden, 22 Oct 2024).

6. Bias, Fairness, and Limitations

LLMs, while powerful, inherit and can amplify various forms of bias from their training data. The LLM Bias Index (LLMBI) (Oketunji et al., 2023) provides a quantitative framework for multi-dimensional bias evaluation, combining weighted annotation scores, dataset diversity penalties, and sentiment corrections:

$\mathrm{LLMBI} = \sum_{i=1}^n W_i B_i + P(D) + 2S$

where each $B_i$ is a detected bias along dimension $i$ , $W_i$ is its weighting, $P(D)$ is the diversity penalty, and $S$ is the sentiment bias score.

While LLMs have demonstrated emergent reasoning, retrieval, and multilingual transfer abilities, challenges remain:

Hallucination and Reliability: Models frequently generate plausible but false content, especially outside their training manifold.
Data and Resource Efficiency: Training and inference at the current scale require immense computational resources and energy, posing sustainability concerns.
Tokenization and Large Input Limitations: Long or complex input representations (e.g., in code, molecular structures, or multi-modal contexts) can be bottlenecked by token limits and lossy tokenizations.
Adversarial and Security Risks: LLMs are susceptible to prompt engineering/jailbreak attacks that may induce them to output harmful or unsafe content, including, potentially, malware (Jelodar et al., 7 Apr 2025).

7. Future Directions and Open Challenges

Ongoing research aims to:

Advance architectural efficiency (memory, inference, multi-context scaling)
Promote cross-lingual parity through improved pretraining data and language-specific architectural tuning (Ali et al., 27 Aug 2024, Du et al., 5 Apr 2024)
Integrate explicit symbolic and neural reasoning modules (neurosymbolic models)
Develop more interpretable, robust, and trustworthy LLMs by probing internal representations and refining metrics for alignment, calibration, and bias (Oketunji et al., 2023, Gokden, 22 Oct 2024)
Deploy domain-specific and multi-modal LLMs for specialized domains such as robotics, scientific computing, and molecular sciences (Kim et al., 6 Jan 2024, Murtada et al., 21 Jul 2025)
Improve detection and management of AI-generated content using hybrid or ensemble models combining LSTM, Transformer, and CNN layers (Mo et al., 6 Apr 2024)

There is a broad consensus that while LLMs have shifted the boundaries of what is possible in natural language and multi-modal processing, substantial theoretical and engineering challenges remain in scaling, aligning, interpreting, and safely deploying these powerful models.