How GPT learns layer by layer (2501.07108v1)

Published 13 Jan 2025 in cs.AI

Abstract: LLMs excel at tasks like language processing, strategy games, and reasoning but struggle to build generalizable internal representations essential for adaptive decision-making in agents. For agents to effectively navigate complex environments, they must construct reliable world models. While LLMs perform well on specific benchmarks, they often fail to generalize, leading to brittle representations that limit their real-world effectiveness. Understanding how LLMs build internal world models is key to developing agents capable of consistent, adaptive behavior across tasks. We analyze OthelloGPT, a GPT-based model trained on Othello gameplay, as a controlled testbed for studying representation learning. Despite being trained solely on next-token prediction with random valid moves, OthelloGPT shows meaningful layer-wise progression in understanding board state and gameplay. Early layers capture static attributes like board edges, while deeper layers reflect dynamic tile changes. To interpret these representations, we compare Sparse Autoencoders (SAEs) with linear probes, finding that SAEs offer more robust, disentangled insights into compositional features, whereas linear probes mainly detect features useful for classification. We use SAEs to decode features related to tile color and tile stability, a previously unexamined feature that reflects complex gameplay concepts like board control and long-term planning. We study the progression of linear probe accuracy and tile color using both SAE's and linear probes to compare their effectiveness at capturing what the model is learning. Although we begin with a smaller LLM, OthelloGPT, this study establishes a framework for understanding the internal representations learned by GPT models, transformers, and LLMs more broadly. Our code is publicly available: https://github.com/ALT-JS/OthelloSAE.

Summary

The paper demonstrates that layer-wise analysis using sparse autoencoders reveals disentangled compositional features for board structure and game strategy.
It compares linear probing with SAE methods, showing that deeper layers improve linear separability while SAEs uncover nuanced, interpretable dynamics.
The study identifies intermediate layers as key for encoding dynamic attributes like tile stability, validated by AUROC scores exceeding 0.7.

The paper provides a comprehensive mechanistic investigation of how a GPT-based model, referred to as OthelloGPT, develops its internal representations on a layer-by-layer basis. It employs a controlled experimental environment in the domain of Othello gameplay to paper the emergence of world models in transformer models and LLMs. The paper utilizes two complementary interpretability techniques—Sparse Autoencoders (SAEs) and linear probes—to disentangle and quantify the features learned at various depths of a decoder-only transformer.

The investigation is broadly structured around the following technical contributions:

Layer-wise Feature Disentanglement The work systematically extracts the hidden representations (residual stream embeddings) from each of the eight transformer layers of OthelloGPT. By applying SAEs with an objective of minimizing the function $L(x, \hat{x}) = \left\Vert x - \hat{x}\right\Vert ^2 + \lambda \left\Vert h \right\Vert _1,$ where $x$ is the original input, $\hat{x}$ is the reconstructed input, $h$ is the latent representation, and $\lambda$ is the sparsity regularization parameter, the authors obtain a sparse and interpretable basis for features such as board structure and gameplay dynamics. These SAE-derived features permit a more granular view of the compositional attributes encoded across layers than do linear probes.
Comparison with Linear Probing Linear probes are employed to evaluate the predictive power of the learned representations by fitting classifiers (i.e., $g_{\phi}(\mathbf{h})=\mathbf{W}^T \mathbf{h}$ ) on the frozen hidden states extracted from each layer. While the probe accuracy is observed to increase with depth—indicating improved linear separability of board state information—the method tends to capture only strong correlates for the classification task rather than revealing the underlying compositional structure. The paper contrasts these findings with the results from the SAE analysis, emphasizing that SAEs consistently highlight more robust and disentangled features.
Tile Color Analysis The paper details two complementary methods to decode the learning of tile color representations. First, a linear probe coupled with cosine similarity analysis is used to assess the contribution of individual MLP neurons across layers for specific board positions. Neurons exceeding a cosine similarity threshold of 0.2 (computed via, $\mathrm{similarity}(\mathbf{a},\mathbf{b}) = \frac{\mathbf{a}\cdot\mathbf{b}}{\|\mathbf{a}\|\|\mathbf{b}\|},$ where $\mathbf{a}$ and $\mathbf{b}$ are the parameter vectors) show dispersed activations across tiles. In contrast, SAE extraction yields activation maps that distinctly highlight positional patterns—particularly strong activations at board edges and corners in the early layers, and more concentrated dynamics in central regions at intermediate layers. These observations are aggregated over 10 random initialization seeds with AUROC values exceeding 0.7 for the top 50 features, thereby validating the robustness and consistency of the learned tile color representations.
Tile Stability Features A novel aspect of the work is the analysis of tile stability—a concept defined by whether a tile can be flipped after it has been placed. The paper computes binary stability maps over a large number of board states (104,000 states computed from 2,000 games) and correlates these maps with feature activations across layers. The results indicate that intermediate layers (Layers 2 through 4) are particularly adept at encoding stability. Detailed tables identify specific features (e.g., features 349 and 108) with AUROC scores as high as 0.8814 and 0.8731 in certain layers. This trend suggests that while early and later layers may focus on static attributes or other aspects of the game, the middle stages of the network exhibit a specialization for capturing the dynamic nuances that underlie long-term strategic features.
Architectural and Training Details OthelloGPT is an 8-layer, decoder-only transformer model with 8 attention heads and a residual stream dimension of 512. The model is trained on next-token prediction tasks using a synthetic dataset comprising sequential gameplay transcripts of Othello. More formally, given a token sequence $\mathbf{x} = (x_1, x_2, \dots, x_T)$ , the training objective is to minimize the autoregressive loss: $\mathcal{L}_{\text{token}} = -\frac{1}{T} \sum_{t=1}^{T} \log p(x_t \mid x_{<t}),$ where $p(x_t \mid x_{<t})=\text{softmax}(\mathbf{h}_t\mathbf{W}_{\text{output}})$ , with $\mathbf{h}_t$ representing the hidden state at timestep $t$ . This formulation underscores the point that the world model is implicitly constructed without explicit task-specific supervision for strategic planning or board analysis.
Implications for Model Interpretability and Agent Behavior The paper discusses how the interpretability insights derived from analysis of OthelloGPT may translate to a broader understanding of agents based on LLMs. In particular, it highlights the importance of coherent internal representations for adaptive decision-making. While linear probes provide evidence of improved classification from deep representations, the SAE approach reveals that disentangled and compositional features—key for reliable long-term planning and stable strategic behavior—are concentrated in intermediate layers.

In summary, the paper establishes a robust framework for layer-wise analysis of internal representations in transformer-based models. By comparing and contrasting SAEs with linear probing techniques, it provides a detailed account of how features related to board structure, tile color, and tile stability emerge as a function of network depth. The methodological and experimental findings strongly suggest that while deep transformer models naturally improve linear separability of game-state features with depth, a fundamental understanding of feature composition and dynamic state encoding requires interpretability tools that go beyond task-specific classification metrics, such as SAEs.

PDF Markdown

Related Papers

GitHub

GitHub - ALT-JS/OthelloSAE: CS194-196 Course Project (6 stars)

Tweets

https://twitter.com/alishbaimran_/status/1879286505434464427

https://twitter.com/kellyhongsn/status/1879342731975328167

https://twitter.com/GptMaestro/status/1879435238058918399

YouTube

Show All Videos