Pen and Paper Exercises in Machine Learning (2206.13446v1)
Abstract: This is a collection of (mostly) pen-and-paper exercises in machine learning. The exercises are on the following topics: linear algebra, optimisation, directed graphical models, undirected graphical models, expressive power of graphical models, factor graphs and message passing, inference for hidden Markov models, model-based learning (including ICA and unnormalised models), sampling and Monte-Carlo integration, and variational inference.
Summary
- The paper presents a rigorous set of exercises that enhance ML understanding through detailed mathematical derivations instead of coding.
- It systematically covers key areas including linear algebra, optimization, and graphical models with practical, step-by-step examples.
- The work demonstrates the practical implications of algorithms like the power method, gradient descent, and message passing in AI research.
This document, "Pen and Paper Exercises in Machine Learning" (2206.13446), offers a collection of exercises and detailed solutions designed to deepen understanding of fundamental machine learning concepts through mathematical derivation and analysis, rather than solely relying on coding. Authored by Michael U. Gutmann, the book emphasizes unsupervised methods, inference, and learning, providing a strong mathematical foundation crucial for applying AI research.
The content is structured into several chapters, each covering a core area of machine learning:
Linear Algebra: This chapter revisits foundational linear algebra techniques essential for many ML algorithms. Exercises cover:
- Gram-Schmidt Orthogonalisation: Practical application involves understanding how to construct an orthogonal basis from a set of linearly independent vectors. This is crucial for algorithms like PCA or solving linear systems in certain ways. The process also provides a method to check for linear independence; if applying the algorithm to a vector results in a zero vector, it is linearly dependent on the preceding ones.
- Linear Transforms: Exercises demonstrate how linear transformations affect geometric properties, specifically the area of a parallelogram spanned by vectors. The determinant of the transformation matrix quantifies this area scaling ∣det(A)∣. This intuition directly translates to the change of variables formula in probability theory, explaining why the Jacobian determinant is needed when transforming random variables y=Ax, where ∫f(y)dy=∫f(Ax)∣detA∣dx.
- Eigenvalue Decomposition: This decomposition (AU=UΛ) is fundamental for understanding linear transformations and data variance (e.g., in PCA). Exercises show how it relates to matrix inverses (A−1=UΛ−1U−1) and spectral properties.
- Trace, Determinants, and Eigenvalues: Key results linking the trace of a matrix to the sum of its eigenvalues ($\tr(\mathbf{A}) = \sum \lambda_i$) and the determinant to the product of its eigenvalues (det(A)=∏λi) are derived using eigenvalue decomposition. These properties are vital for understanding the effects of linear operations in multivariate statistics and optimization.
- Eigenvalue Decomposition for Symmetric Matrices: A practically important result is that eigenvectors of symmetric matrices corresponding to distinct eigenvalues are orthogonal. Positive definite symmetric matrices, common in covariance matrices and optimization (Hessians), are shown to have strictly positive eigenvalues, implying invertibility.
- Power Method: This iterative algorithm provides a practical way to find the dominant eigenvector (the one with the largest eigenvalue) of a symmetric positive definite matrix. The convergence analysis reveals that repeated multiplication by the matrix amplifies components along the dominant eigenvector's direction, making it converge. This is directly applicable in algorithms like PCA to find the first principal component without full eigenvalue decomposition, which can be computationally cheaper for large matrices.
Optimisation: This chapter explores the calculus of vector and matrix functions, providing the tools for gradient-based optimization in ML. Exercises cover:
- Gradient of Vector-valued Functions: Calculating gradients ∇J(w) for common functions like a⊤w, w⊤Aw, ∣∣w∣∣2, and composite functions. The definition ∇J(w)⊤h as the first-order term in the Taylor expansion of J(w+ϵh) provides a powerful way to compute gradients.
- Newton's Method: This optimization method approximates a function locally as a quadratic and jumps to its minimum. Exercises derive the update rule w∗=w0−H−1∇J(w0), where H is the Hessian matrix. Understanding this forms the basis for many second-order optimization techniques used in complex models.
- Gradient of Matrix-valued Functions: Extends gradient computation to functions of matrices, using the trace-based definition $J(\mathbf{W} + \epsilon \mathbf{H}) = J(\mathbf{W}) + \epsilon \tr(\nabla J(\mathbf{W})^\top \mathbf{H}) + O(\epsilon^2)$. Examples include gradients for u⊤Wv and u⊤W−1v.
- Gradient of the Log-Determinant: A crucial result for models involving Gaussian distributions (e.g., precision matrices) is the gradient of log∣det(W)∣, derived as (W−1)⊤.
- Descent Directions for Matrix-valued Functions: Proves that the negative gradient −∇J(W) provides a descent direction, meaning taking a small step in this direction decreases the function value. This generalizes the concept from vector optimization to matrix optimization.
Directed Graphical Models: This section focuses on representing conditional independencies using Directed Acyclic Graphs (DAGs). Exercises cover:
- Directed Graph Concepts: Basic graph terminology like trails, paths, topological ordering are defined and applied to an example graph.
- Canonical Connections: Analysis of serial (X→Z→Y), diverging (X←Z→Y), and converging (X→Z←Y) connections demonstrates the core d-separation rules. Serial and diverging connections make X and Y dependent unless Z is observed. Converging connections make X and Y independent unless Z or a descendant of Z is observed. This is fundamental for understanding how information flows and is blocked in a DAG.
- Markov Properties and d-separation: The ordered, local, and global (d-separation) Markov properties are explored. Exercises show how to read independencies from a graph using d-separation. The Markov blanket of a node (parents, children, and co-parents) is identified as the minimal set rendering the node independent of all others. The parameter count for a binary DAG is related to the structure (number of parents).
- Chest Clinic Example: A classic Bayesian network example is used to practice applying d-separation rules to determine conditional independencies in a more complex, real-world-inspired scenario. This demonstrates how graph structure encodes independence assumptions relevant for simplifying probabilistic queries.
- Hidden Markov Models (HMMs): The DAG structure of an HMM is analyzed. Its factorization property (p(h,v)=p(h1)p(v1∣h1)∏p(hi∣hi−1)p(vi∣hi)) is derived. D-separation is applied to understand independencies within the hidden state chain and between hidden and observed states.
- Characterisation of Independencies: Different equivalent definitions of conditional independence (p(X,Y∣Z)=p(X∣Z)p(Y∣Z) or p(X,Y,Z)=a(X,Z)b(Y,Z)) are proven. The latter factorization property is particularly important for understanding undirected models.
Undirected Graphical Models: This chapter focuses on representing conditional independencies using Undirected Graphs (UGMs). Exercises cover:
- Gibbs Distributions and Graphs: Drawing an UGM from a given factorization into potential functions. Identifying neighbours and maximal cliques. The local Markov property in UGMs states a variable is independent of all others given its neighbours (Markov blanket). This is checked using graph separation.
- Factorisation and Independencies: UGMs factorize according to their maximal cliques. Independencies are determined by graph separation: X⊥Y∣Z if all paths between X and Y are blocked by Z.
- Factorisation from Markov Blankets: Constructing a UGM given the Markov blanket for each variable. For positive distributions, the local Markov property (defined by neighbours/Markov blanket) is equivalent to factorization over the graph.
- Pairwise Potentials: UGMs with pairwise potential functions (like Ising models or Gaussian Markov Random Fields) are explored. Independence between variables Xi and Xj is linked to the absence of an edge between them, which corresponds to their joint potential ϕij being constant or factorizing. For Gaussian MRFs, this relates to zero entries in the precision matrix.
- Restricted Boltzmann Machine (RBM): This bipartite UGM is analyzed. Graph separation confirms the key conditional independencies: visible variables are independent given the hiddens, and hiddens are independent given the visibles. The conditional probabilities p(hi∣v) and p(vi∣h) are derived, showing they are logistic (sigmoid) functions of linear combinations of the connected variables. This is crucial for RBM inference and learning (e.g., Gibbs sampling, contrastive divergence).
- HMMs and Change of Measure (UGM view): The UGM representation of an HMM, obtained by moralizing the DAG, is examined. Conditioning on observed variables simplifies the graph structure. The concept of "change of measure" is introduced, where multiplying a base distribution by a weight function (like the likelihood or observational potentials) results in a new distribution. This is a foundational concept for sequential Monte Carlo methods.
Factor Graphs and Message Passing: This section introduces factor graphs as bipartite graphs explicitly showing factors and variables, and the message passing algorithms on tree structures. Exercises cover:
- Conversion to Factor Graphs: Demonstrating how to convert DAGs and UGMs into factor graphs. Each conditional probability in a DAG or potential function in an UGM becomes a factor node connected to its arguments (variable nodes). Comparing the structure of the factor graph to the moral graph (loops may differ).
- Sum-Product Message Passing: Detailed application of the sum-product algorithm on a factor tree for marginalization. Messages are computed iteratively between variable nodes and factor nodes. Variable nodes multiply incoming messages; factor nodes multiply incoming messages and the factor function, then sum (marginalize) over the variable being sent to. Numerical calculation for binary variables is shown step-by-step. Handling conditioning by fixing variable values and modifying connected factors is demonstrated.
- Max-Sum Message Passing: Application of the max-sum algorithm, which is sum-product in the log-domain with summation replaced by maximization. This algorithm finds the configuration with the maximum probability (MAP state). Backtracking from the root allows recovering the maximizing variable assignments. Detailed steps for numerical calculation on a factor tree are provided, illustrating the log-domain operations and backtracking functions.
- Choice of Elimination Order: Demonstrating variable elimination on loopy factor graphs. The order of elimination significantly impacts computational complexity (size of intermediate factors). Choosing an efficient order, typically by eliminating variables with fewer connections first, is crucial for performance.
Inference for Hidden Markov Models: Focuses on standard algorithms for inference in HMMs. Exercises cover:
- Predictive Distributions: Using message passing on the HMM factor graph to derive recursive formulas for computing p(ht∣v1:u) (predicting hidden state given past observations) and p(vt∣v1:u) (predicting future observation). This relates to the 'predict' step in filtering/prediction algorithms.
- Viterbi Algorithm: Deriving the Viterbi algorithm using max-sum message passing on the HMM chain. This algorithm efficiently finds the single most likely sequence of hidden states given the observations, widely used in speech recognition and bioinformatics. The recursive steps involve computing maximal "path" probabilities to each state at each time step.
- Forward Filtering Backward Sampling (FFBS): This algorithm samples complete trajectories of hidden states from the posterior p(h1:n∣v1:n). It combines a forward pass (filtering, computing alpha messages α(ht)∝p(ht∣v1:t)) with a backward pass where states ht−1 are sampled sequentially from p(ht−1∣ht,v1:n), derived using alpha messages. This is essential for generating realistic sequences under the posterior uncertainty.
- Kalman Filtering: A specialized inference algorithm for HMMs with linear Gaussian transitions and emissions. The exercises derive the recursive update equations for the posterior mean and covariance of the hidden state, p(ht∣v1:t), which remain Gaussian. Properties of Gaussian distributions under linear operations and products are used to derive the prediction and correction steps involving the Kalman gain. This highlights how analytical solutions are possible for specific model structures and distributions.
Model-Based Learning: Explores parameter estimation techniques, primarily maximum likelihood and Bayesian inference. Exercises cover:
- Maximum Likelihood Estimation (MLE): Deriving the MLE for the parameters of a Gaussian distribution (mean and variance) from iid data by maximizing the log-likelihood. The estimates are shown to be the sample mean and variance.
- Bayesian Inference for Gaussian Mean: Deriving the posterior distribution for the mean of a Gaussian with a Gaussian prior and known variance. Demonstrates conjugacy, where the posterior is also Gaussian. Analysis shows how the posterior mean shrinks towards the prior mean and how the posterior variance decreases with more data, concentrating mass around the MLE.
- MLE for Fully Observed Binary DAGs: Shows that for a fully observed DAG with binary variables and conditional probability tables (CPTs), the log-likelihood function decomposes into independent terms, one for each entry in each CPT. This implies that the MLE for each CPT entry p(xi=1∣pai=s) is simply the conditional sample proportion: the number of times xi=1 when pai=s, divided by the total number of times pai=s. The cancer-asbestos-smoking example illustrates potential issues with zero counts leading to undefined or extreme (0 or 1) MLEs.
- Bayesian Inference for Fully Observed Binary DAGs: Extends Bayesian inference for a single Bernoulli parameter (using a Beta prior) to the parameters of CPTs in a binary DAG (using independent Beta priors for each entry). Shows that the posterior also factorizes, and each parameter's posterior is a Beta distribution with hyperparameters updated by adding the counts from the data (pseudo-counts). The posterior predictive probability p(xi=1∣pai,yo) is derived as the posterior mean of the corresponding CPT parameter. This demonstrates how Bayesian inference, particularly with conjugate priors, provides smoothed estimates compared to MLE, addressing issues with zero counts and offering a principled way to incorporate prior knowledge. The cancer example shows how posterior predictive probabilities avoid the extreme values seen with MLE for low data counts.
- Factor Analysis (FA) and Independent Component Analysis (ICA): Explores these latent variable models. An exercise on FA shows that correlating latent variables does not increase model expressiveness compared to standard FA with independent latents; the correlation can be absorbed into the factor loading matrix. An exercise on ICA introduces whitening as a preprocessing step to decorrelate data, showing how it simplifies the problem by making the mixing matrix orthonormal, reducing the degrees of freedom that need to be learned.
- Score Matching: Introduces score matching as an alternative parameter estimation method for unnormalised models that does not require computing the partition function. It minimizes the squared difference between the model's score function (gradient of log density) and the data's score function (estimated empirically). For continuous exponential family models, the score matching objective is shown to be a quadratic form, enabling efficient optimization. For the simple Gaussian variance example, score matching is shown to recover the MLE, highlighting its benefit of not needing the partition function calculation, which is often intractable for unnormalised models with discrete variables (like Ising models or RBMs). The limitation of score matching for binary data is noted.
Sampling and Monte Carlo Integration: Covers techniques for approximating integrals and generating samples from distributions. Exercises cover:
- Importance Sampling: Demonstrates importance sampling for estimating expectations, particularly for rare events (like tail probabilities of a Gaussian). Highlights the inefficiency of standard Monte Carlo for such problems. Shows how choosing an appropriate importance distribution (e.g., an exponential distribution shifted to the tail) can make the estimation much more efficient. The formula for the importance sampling estimator is derived.
- Monte Carlo Integration Issues: Uses the example of a heavy-tailed Cauchy distribution and a light-tailed Gaussian importance distribution to illustrate when importance sampling estimates can be unreliable due to infinite variance of the importance weights, even if the estimator is unbiased. This emphasizes the need for the importance distribution to have heavier or equally heavy tails as the target distribution.
- Inverse Transform Sampling: Proves the fundamental result that applying the CDF of a continuous random variable to the variable itself yields a uniform distribution on [0,1]. This is the basis for inverse transform sampling: sample from uniform, then apply the inverse CDF of the target distribution to get samples from the target. Applications to exponential and Laplace distributions are shown by deriving their inverse CDFs.
- Rejection Sampling: Describes the rejection sampling algorithm (sample from a proposal, accept with a probability). The acceptance probability depends on the ratio of the target and proposal densities, scaled by a constant M=maxxp(x)/q(x). The optimal M (minimizing rejections) is derived for sampling a Gaussian using a Laplace proposal. The exponential decay of acceptance probability with dimension for independent distributions is shown, highlighting the "curse of dimensionality" in naive rejection sampling.
- Gibbs Sampling for RBMs: Explains how to sample from the marginal distribution of visible units p(v) in an RBM using Gibbs sampling on the joint p(v,h), leveraging the conditional independencies p(h∣v)=∏p(hi∣v) and p(v∣h)=∏p(vi∣h). Block Gibbs sampling, alternately sampling all hidden units given visibles and all visibles given hiddens, is described as a practical implementation strategy.
Basic Markov chain Monte Carlo inference: Introduces fundamental MCMC concepts and the Metropolis-Hastings (MH) algorithm. Exercises cover:
- MCMC Fundamentals: Explains that MCMC generates dependent samples that asymptotically approximate a target distribution, useful for approximating expectations in high dimensions where direct integration is intractable.
- Metropolis-Hastings Algorithm: Introduces the MH algorithm (and specifically the Random Walk MH with Gaussian proposals) as a method to generate samples from a target distribution proportional to a known unnormalised function p∗(x). The acceptance criterion based on the ratio p∗(x∗)/p∗(x) is key.
- MH Implementation and Diagnostics: Guides implementation of MH and demonstrates its behaviour with trace plots. Illustrates the "burn-in" period, where initial samples are not representative and should be discarded. Discusses the impact of the proposal distribution's variance on the mixing of the chain (too small = slow exploration, too large = high rejection).
- Mixing and Convergence: Explains that well-mixed chains explore the state space efficiently, leading to less correlated samples. Introduces Effective Sample Size (ESS) as a measure of the number of independent samples equivalent to the drawn correlated samples. Derives how sample correlation reduces ESS, illustrating that high correlation (ρ→1) drastically reduces ESS towards 1, making the approximation less reliable.
- Bayesian Poisson Regression: Applies MH to sample from the posterior of parameters in a Bayesian Poisson regression model. Derives the unnormalised posterior as the product of likelihoods and priors. This provides a practical example of using MH for posterior inference in a non-Gaussian model where conjugate priors might not exist or be convenient.
Overall, the document serves as an excellent resource for solidifying the mathematical and probabilistic underpinnings of core machine learning techniques. The "pen and paper" approach encourages a deeper understanding of algorithm derivations, model properties, and the reasons behind practical implementation choices, which is invaluable for developers and practitioners looking to apply AI research effectively.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Continue Learning
- How does the understanding of linear algebra concepts, such as eigenvalue decomposition and the power method, directly facilitate computational efficiency in large-scale machine learning problems like PCA?
- In what ways do d-separation rules in graphical models provide practical advantages for designing efficient probabilistic inference algorithms?
- How do message passing algorithms differ when applied to factor graphs derived from DAGs versus UGMs, both in terms of complexity and inference capabilities?
- What are the primary considerations and challenges when choosing between maximum likelihood estimation, Bayesian inference, and score matching for training probabilistic models, particularly with incomplete or unnormalised data?
- Find recent papers about score matching and its applications in deep learning and unnormalised models.
Related Papers
- A Brief Introduction to Machine Learning for Engineers (2017)
- Message Passing and Combinatorial Optimization (2015)
- Learning Undirected Graphical Models with Structure Penalty (2011)
- Collaborative Training in Sensor Networks: A graphical model approach (2009)
- Introduction to Machine Learning: Class Notes 67577 (2009)
Authors (1)
Collections
Sign up for free to add this paper to one or more collections.
Tweets
This paper has been mentioned in 52 posts and received 3287 likes.
YouTube
HackerNews
- Pen and Paper Exercises in Machine Learning (2022) (411 points, 58 comments)
- Pen and Paper Exercises in Machine Learning (1 point, 0 comments)