Foundations of Large Language Models
Abstract: This is a book about LLMs. As indicated by the title, it primarily focuses on foundational concepts rather than comprehensive coverage of all cutting-edge technologies. The book is structured into five main chapters, each exploring a key area: pre-training, generative models, prompting, alignment, and inference. It is intended for college students, professionals, and practitioners in natural language processing and related fields, and can serve as a reference for anyone interested in LLMs.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is a friendly, note-style book about LLMs, like GPT and BERT. It explains how these models learn from huge amounts of text, how they’re built, and how we make them useful for many different tasks. The big idea is that by practicing on simple language puzzles at an enormous scale, a single “foundation model” can pick up a lot of general knowledge and then be adapted to do many things—like answering questions, translating, or classifying text.
Objectives and Questions
The book aims to answer, in simple terms:
- How do LLMs learn so much from text?
- What kinds of model designs (architectures) do we use?
- What training tricks help them become powerful general tools?
- Once trained, how do we make them do specific jobs (like sentiment analysis or translation)?
- How do we guide them to follow human instructions and values?
It focuses on four parts:
- Pre-training: how models learn from massive text without needing labels.
- Generative models: how models produce text and handle long inputs.
- Prompting: how to “ask” models to do tasks using instructions or examples.
- Alignment: how we fine-tune models to be helpful, safe, and follow instructions.
Methods and Key Ideas
Here are the main techniques, explained with everyday analogies:
- Tokens: Think of text split into small pieces (like words or word parts). Models read sequences of tokens.
- Encoder vs. Decoder:
- Encoder-only models are like a super reader: they turn a sentence into a detailed “memory” or representation.
- Decoder-only models are like a skilled storyteller: given some text, they predict the next word over and over to write fluent sentences.
- Encoder-decoder models combine both: read with an encoder, write with a decoder (great for translation).
- Pre-training (learning before specific jobs):
- Supervised: Learn from labeled examples (like homework with answers). Powerful, but gathering labels is expensive.
- Unsupervised: Learn patterns without labels (like noticing trends while reading).
- Self-supervised: Make puzzles from raw text. For example, hide a word and ask the model to guess it. This is the most successful for LLMs because you can use unlimited text from books, websites, etc.
- Key self-supervised tasks:
- Causal language modeling: practice “predict the next word.” Like continuing a story one word at a time.
- Masked language modeling (used in BERT): hide some words and ask the model to fill the blanks using both left and right context. This helps the model understand sentences deeply.
- Permuted language modeling (used in XLNet): predict words in different orders to avoid some masking issues and capture more word-to-word relationships.
- Training mechanics (lightly explained):
- Softmax: turns raw scores into probabilities, like converting test scores into percent chances.
- Loss (cross-entropy): a penalty for wrong guesses; the model tries to minimize this, which means getting better at predicting.
- Maximum likelihood: a fancy way to say “make the correct text as likely as possible.”
- Adapting models to tasks:
- Fine-tuning: like coaching a trained athlete for a specific sport. We update some parameters using a small labeled dataset so the model specializes (e.g., sentiment classification).
- Prompting: like giving clear instructions and examples. Often, with strong LLMs, we can make them do a new task by just describing it in the input:
- Zero-shot: give only instructions, no examples.
- Few-shot/in-context learning: give a few examples in the prompt so the model learns the pattern on the fly.
- Alignment:
- Instruction tuning and learning from human feedback help models follow instructions more reliably and behave in line with human values.
Main Takeaways and Why They Matter
- The central insight: by repeatedly solving simple language puzzles (like predicting the next word or filling blanks) on a giant amount of text, a model can learn broad language and world knowledge.
- Foundation models: once pre-trained, the same model can be adapted quickly to many tasks, reducing the need for huge labeled datasets for each task.
- Masked language modeling (BERT) builds strong “reading” skills; causal modeling (GPT) builds strong “writing” skills. Both are useful, and you choose based on the job.
- Prompting lets us turn many problems into text-generation tasks. This means you can often “program” an LLM with plain language, without extra training.
- Alignment is essential. Even powerful models need guidance to be helpful, safe, and trustworthy.
These ideas have transformed NLP: instead of building separate systems for each job, we start with one strong general model and adapt it efficiently.
Implications and Impact
- Easier, faster development: Teams can solve new language tasks by fine-tuning or prompting a pre-trained model, saving time and money on data labeling.
- Broader reach: The same LLM can help in education, customer support, coding assistance, scientific research, and more.
- Better performance: Foundation models often outperform older task-specific systems.
- Responsibility: As models get stronger, careful alignment and human feedback become even more important to ensure they’re safe, fair, and follow instructions.
In short, LLMs change the way we build AI: we teach one model a lot about language and the world, then guide it—with small tweaks or clear instructions—to do almost any language task.
Glossary
- alignment methods: Techniques to adjust a model’s behavior to follow instructions and human values. "Chapter 4 introduces alignment methods for LLMs."
- argmax: The operation that returns the argument (input) that maximizes a given function. "\argmax_{\theta} \sum_{\mathbf{x} \in \mathcal{D} \log \mathrm{Pr}_{\theta}(\mathbf{x})"
- auto-regressive decoding: Generating tokens sequentially, each conditioned on previously generated tokens. "More precisely, in auto-regressive decoding of machine translation, each target-language token is generated based on both its preceding tokens and source-language sequence."
- autoencoders: Neural networks trained to reconstruct their inputs, often used for unsupervised pre-training. "autoencoders, and others"
- backbone models: Core neural network architectures used as feature extractors for transfer learning. "the backbone models were trained on relatively large labeled datasets such as ImageNet"
- BERT: A bidirectional Transformer-based model trained via masked language modeling. "BERT is then used as an example to illustrate how a sequence model is trained via a self-supervised task, called masked language modeling."
- bidirectional model: A model that uses both left and right context to make predictions. "leading to a bidirectional model that makes predictions based on both left and right-contexts."
- causal language modeling: Predicting the next token using only the preceding tokens (left context). "which is sometimes called causal language modeling"
- chain-of-thought reasoning: Prompting strategy that elicits step-by-step reasoning from models. "such as chain-of-thought reasoning and automatic prompt design."
- cross-attention sub-layers: Attention components where the decoder attends to encoder outputs in seq2seq models. "by simply removing cross-attention sub-layers from it."
- decoder-only architecture: A Transformer setup with only decoder blocks, commonly used for LLMs. "The decoder-only architecture has been widely used in developing LLMs"
- demonstrations: Example input-output pairs included in a prompt to teach the model a task. "These samples, known as demonstrations, are used to teach LLMs how to perform the task."
- downstream tasks: Target tasks to which a pre-trained model is adapted or applied. "applied to different downstream tasks"
- encoder-decoder architectures: Seq2seq models with separate encoder and decoder components. "including decoder-only, encoder-only, and encoder-decoder architectures."
- few-shot learning: Adapting to a task with only a small number of examples provided. "Another method for enabling new capabilities in a neural network is few-shot learning."
- fine-tuning: Further training a pre-trained model on labeled data for a specific task. "A typical way is to fine-tune the model by giving explicit labeling in downstream tasks."
- foundation models: Large pre-trained models that can be adapted to many different tasks. "These pre-trained models serve as foundation models that can be easily adapted to different tasks via fine-tuning or prompting."
- generative models: Models that produce sequences (e.g., text) conditioned on context; LLMs are a prime example. "Chapter 2 introduces generative models, which are the LLMs we commonly refer to today."
- GPT: The Generative Pre-trained Transformer family of LLMs. "covers several well-known examples like BERT and GPT"
- hidden state: The internal representation at a given time step in a sequence model. " & hidden state at time step in sequential models"
- ImageNet: A large-scale labeled image dataset used for pre-training in computer vision. "such as ImageNet"
- in-context learning (ICT): Learning from examples included in the prompt without parameter updates. "in-context learning (ICT)."
- instruction fine-tuning: Supervised fine-tuning on datasets built from human-written instructions and responses. "This chapter focuses on instruction fine-tuning and alignment based on human feedback."
- KL divergence: A measure of how one probability distribution diverges from another. " & KL divergence between distributions and "
- language modeling: Predicting the next token in a sequence of text. "large-scale language modeling tasks"
- LLMs: Very large neural LLMs trained on massive text corpora. "LLMs originated from natural language processing"
- log-scale cross-entropy: Cross-entropy loss computed in log scale, commonly used for training classifiers and LLMs. "In NLP, the log-scale cross-entropy loss is typically used."
- masked language modeling: Predicting masked tokens in a sequence using surrounding context. "called masked language modeling."
- maximum likelihood estimation: Estimating model parameters by maximizing the likelihood of observed data. "Note that this objective is mathematically equivalent to maximum likelihood estimation"
- one-hot representation: A vector encoding where a single position is 1 and all others are 0, indicating a categorical label. "we can think of $\mathbf{p}_{i+1}^{\mathrm{gold}$ as a one-hot representation of the correct predicted word."
- permuted language modeling: Training to predict tokens in a randomly permuted order to capture bidirectional dependencies. "using the permuted language modeling approach to pre-training"
- prompting: Crafting inputs (prompts) that instruct a model to perform a desired task without further training. "Prompting and in-context learning play important roles in the recent rise of LLMs."
- query, key, and value matrices: Matrices used in attention mechanisms to compute weighted combinations of representations. ", , & query, key, and value matrices in attention mechanisms"
- self-supervised learning: Training using supervisory signals derived from the data itself rather than labels. "A third approach to pre-training is self-supervised learning."
- self-training: Semi-supervised method where a model learns from its own pseudo labels to improve iteratively. "a related concept is self-training where a model is iteratively improved by learning from the pseudo labels assigned to a dataset."
- sequence encoding models: Models that convert input sequences into vector representations. "Sequence Encoding Models"
- sequence generation models: Models that generate token sequences given some context or input. "Sequence Generation Models"
- Softmax: A function that converts scores into a probability distribution. "Softmax function that normalizes the input vector or matrix "
- tokenization: The process of splitting text into tokens for modeling. "tokens are basic units of text that are separated through tokenization."
- Transformer: A neural architecture based on self-attention for sequence modeling. "such as Transformers \cite{Vaswani-etal:2017Transformer}"
- unsupervised learning: Learning patterns from data without labeled outputs. "many early attempts to achieve pre-training were focused on unsupervised learning."
- zero-shot learning: Performing a task without having seen any labeled examples of that task during training. "This example also demonstrates the zero-shot learning capability of LLMs"
Collections
Sign up for free to add this paper to one or more collections.