Chinchilla 70B: Optimal Scaling & Interpretability

Updated 4 June 2026

Chinchilla 70B is a 70B parameter language model engineered to optimize the trade-off between model size and training tokens using compute-optimal scaling laws.
It features an advanced transformer architecture with 80 layers and 64 attention heads per layer, trained with an optimal tokens-per-parameter ratio of roughly 20.
Interpretability is enhanced through circuit analysis techniques like logit attribution and activation patching, revealing low-rank subspaces in feature semantics.

The Chinchilla 70B model refers to a ~70 billion parameter LLM, engineered to optimize predictive performance within a fixed compute budget by adjusting the allocation of data and parameter count according to empirically derived scaling laws. This model regime was first established by Hoffmann et al. and subsequently analyzed and replicated by Besiroglu et al., who clarified both the compute-optimal configuration and the efficacy of related interpretability methodologies on models of this scale. Chinchilla 70B serves as a canonical example in both scaling law research and transformer interpretability, features an 80-layer architecture with 64 attention heads per layer, and was trained with a calibrated tokens-per-parameter ratio that diverges sharply from approaches prior to the Chinchilla scaling insights (Besiroglu et al., 2024, Lieberum et al., 2023).

1. Compute-Optimal Scaling Law and Functional Formulation

The Chinchilla 70B regime is defined through the explicit optimization of the trade-off between model parameters ( $N$ ) and dataset size ( $D$ , tokens), holding computational expenditure ( $C \approx ND$ ) constant. The loss function for final cross-entropy is empirically modeled as:

$L(N, D) = E + A N^{-\alpha} + B D^{-\beta}$

where $E$ denotes the irreducible loss floor, and $A$ , $B$ , $\alpha$ , and $\beta$ are fitted constants. For simultaneous log-space stability and interpolation, Hoffmann et al. introduce a log-sum-exp (LSE) parameterization:

$\mathrm{LSE}(x, y, e) = e + \log(\exp(x - e) + \exp(y - e))$

applied as:

$D$ 0

Parameter fitting uses a Huber loss in log-loss space and L-BFGS-B optimization initialized on a $D$ 1 grid. This parametric law underpins identification of the compute-optimal regime (Besiroglu et al., 2024).

2. Empirical Fit, Confidence Intervals, and Replication

Using digitization of Hoffmann et al.'s contour plots, Besiroglu et al. reconstruct a data set of $D$ 2 triples and refit the parametric loss law, reporting the following bootstrapped best-fit parameters (standard errors in parentheses):

Parameter	Estimate	SE
$D$ 3	482.01	124.58
$D$ 4	2085.43	1293.23
$D$ 5	1.8172	0.030
$D$ 6	0.3478	0.020
$D$ 7	0.3658	0.020

The "compute-allocation exponent" $D$ 8 (SE = 0.02), $D$ 9 (SE = 0.02). Applying these to Chinchilla’s training budget yields $C \approx ND$ 0 (∼70B parameters), $C \approx ND$ 1 (∼1.4T tokens), or $C \approx ND$ 220 tokens per parameter—precisely matching the actual Chinchilla 70B regime (Besiroglu et al., 2024).

Earlier parameter reports by Hoffmann et al. ( $C \approx ND$ 3, $C \approx ND$ 4, $C \approx ND$ 5, $C \approx ND$ 6, $C \approx ND$ 7) produce systematically biased residuals and an implausible ~70 tokens per parameter prescription. Besiroglu et al. further determine that original confidence intervals were too narrow to be compatible with the experiment count (600,000 runs implied, $C \approx ND$ 8500 actually realized), resolving this with corrected fits and bootstrapped uncertainty (Besiroglu et al., 2024).

3. Model Architecture and Training Regimen

The Chinchilla 70B model utilizes 80 transformer layers, each containing 64 attention heads and an MLP block, totaling approximately 70 billion parameters. Training is performed with a compute allocation that maintains the empirical optimal of 20 tokens per parameter, distinguishing it sharply from previous practices that favored substantially higher tokens-per-parameter ratios (Lieberum et al., 2023).

Chinchilla 70B was trained at a total compute budget of $C \approx ND$ 9 FLOP, aligning with the regime derived from the corrected scaling law. The training corpus is sized at approximately 1.4 trillion tokens, further confirming strict adherence to the derived optimal allocation (Besiroglu et al., 2024).

4. Interpretability and Circuit Analysis at Scale

A key investigation of the Chinchilla 70B model involves circuit analysis interpretability, assessing the scalability of attribution and intervention techniques. The primary methodologies are:

Logit Attribution: Decomposition of the logits into direct node-wise contributions via the unembedding matrix, enabling attribution of logit changes to individual attention heads and MLPs.
Attention Pattern Visualization: Clustering attention heads by their value-weighted attention distributions to elucidate functional groupings.
Activation Patching (Causal Interventions): Manipulation of intermediate activations to quantify total causal effects on final logits.

All three methods scale to 80-layer, 70B-parameter models when backed by large-batch mat-muls and distributed caching of residuals. In application, 45 nodes (32 heads, 13 MLPs) explain 80% of the positive direct effect on correct-label logits in multiple-choice tasks. These nodes are functionally categorized as:

Category	Description
Correct-letter heads	Attend from the final token strictly to the correct label (A/B/C/D), boosting the corresponding logit.
Uniform heads	Attend uniformly to all label tokens, hypothesized to encode "is multiple-choice".
Single-letter heads	Attend persistently to a specific letter (e.g., always A), functioning in backup or superposition.
Amplification heads	Mediate and propagate content from earlier output heads to final token positions.

Additionally, "content gatherers" attend to answer content and interface with correct-letter heads, conferring indirect effects (Lieberum et al., 2023).

5. Subspace Compression and Feature Semantics

Circuit analysis demonstrates that correct-letter heads operate predominantly within a low-dimensional subspace in their query (Q), key (K), and value (V) projections. Empirically, singular value decomposition (SVD) reveals that three principal components capture 65–90% of the relevant variance for these projections. Substituting full-rank Q, K, V with their rank-3 approximations at label and final positions preserves logit attribution and patching performance: loss under activation patching remains virtually unchanged (average NLL ≈ 1.22–1.23 bits; accuracy ≈ 64.6–64.8%), indicating that the "copy the correct label" operation is inherently low rank (Lieberum et al., 2023).

Analysis in this compressed subspace uncovers an emergent "Nth-item in an enumeration" feature: key deltas for A, B, C, D form a tetrahedral configuration, and the query delta of the final token clusters with the correct answer's corner. This reveals a composite representational mechanism involving both enumeration and frequent token-identity features. However, generalization is partial—mutation to alphabetically successive tokens (e.g., MNOP) yields partial recovery, while numeric labels or fully randomized letters substantially degrade both attribution and model performance, indicating specialization (Lieberum et al., 2023).

6. Methodological Limitations and Open Problems

While logit attribution, attention-pattern clustering, and activation patching scale to models of Chinchilla 70B’s size, semantic analysis of features and full circuit disentanglement remain open challenges. Identified low-rank features encode both positional (list-index) and token-identity signals, but such representations only partially generalize beyond the training or label regime (e.g., performance with numeric label tokens collapses to chance). Many MLPs and intermediate connections involved in content copying and logit formation defy simple mechanistic interpretation. This suggests that, although circuit analysis primitives do transfer reliably to large-scale models, comprehensive, generalizable codes for internal mechanisms are not yet fully extracted. Automated subspace search and parameterization techniques, as well as further mapping of intermediate circuits, are proposed directions for future work (Lieberum et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

Chinchilla Scaling: A replication attempt (2024)

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Chinchilla 70B Model.