Papers
Topics
Authors
Recent
Search
2000 character limit reached

Depth-Aware Capacity Allocation

Updated 25 June 2026
  • The paper shows that reallocating MLP widths to early layers improves language modeling perplexity and downstream performance.
  • Depth-aware capacity allocation defines a systematic, monotonic reduction in layer dimensions while strictly preserving the overall compute and parameter budget.
  • Empirical studies across architectures and scales demonstrate up to 1.0 point gains on commonsense tasks and stable long-context retrieval performance.

The Tapered LLM (TLM) principle introduces a depth-aware architectural paradigm for LLMs in which parameter capacity is monotonically reduced across network depth. Unlike the conventional practice of assigning uniform parameter count to each layer, the TLM approach requires that a selected per-layer dimension—most naturally the hidden width of each feed-forward (MLP) block—is decreased from early to late layers, while the global parameter and compute budget is strictly maintained. This strategy is backed by empirical evidence that early layers contribute more novel information to the residual stream, and that non-uniform allocation of capacity improves language modeling perplexity and downstream benchmark performance, without additional resources or changes to training regime (Bayat et al., 22 Jun 2026).

1. Architectural Principle and Formalism

The TLM principle asserts two formal requirements for any chosen per-layer dimension dC(l)d_C(l) (commonly, C="MLP width"C=\text{"MLP width"}): (i) Monotonicity: dC(l+1)≤dC(l)d_C(l+1) \leq d_C(l) for all ll (layer index), and (ii) Budget preservation: 1L∑l=0L−1dC(l)=dCbaseline\frac{1}{L}\sum_{l=0}^{L-1} d_C(l) = d_C^{\text{baseline}}, so the average width across all LL layers matches that of a uniform-width baseline.

When applied to hidden MLP width, each layer ll computes

zl=hl+Ml(hl),hl+1=zl+Fl(zl)z_l = h_l + \mathcal{M}_l(h_l), \quad h_{l+1} = z_l + \mathcal{F}_l(z_l)

where hl∈RN×dh_l \in \mathbb{R}^{N \times d} is the residual stream, Ml\mathcal{M}_l is a token-mixing module (e.g., self-attention), and C="MLP width"C=\text{"MLP width"}0 is a feed-forward network parameterized by C="MLP width"C=\text{"MLP width"}1. The conventional "uniform" approach sets C="MLP width"C=\text{"MLP width"}2 for all C="MLP width"C=\text{"MLP width"}3. In contrast, TLM schedules C="MLP width"C=\text{"MLP width"}4 so that it decreases with depth but its average equals C="MLP width"C=\text{"MLP width"}5, keeping total parameter and compute count invariant.

Three smooth, budget-preserving decay schedules were explored:

  • Linear decay: C="MLP width"C=\text{"MLP width"}6
  • Cosine decay: C="MLP width"C=\text{"MLP width"}7
  • Sigmoid transition: C="MLP width"C=\text{"MLP width"}8 with steepness C="MLP width"C=\text{"MLP width"}9

The parameters dC(l+1)≤dC(l)d_C(l+1) \leq d_C(l)0 and dC(l+1)≤dC(l)d_C(l+1) \leq d_C(l)1 are chosen to satisfy the global budget constraint. Intermediate widths are rounded to the nearest multiple of 16 for matrix-multiply efficiency.

2. Rationale for Depth-Aware Capacity Allocation

Uniform parameter allocation across depth, inherited from the original Transformer architecture, is empirically misaligned with the functional contributions of each layer. When a fixed total budget is partitioned unequally, allocating greater MLP width to early layers and less to later layers actually improves perplexity, while the reverse allocation degrades performance. For instance, in a 440M-parameter Transformer divided into thirds, front-loading capacity (early layers wider) yields lower validation perplexity (15.96) than uniform (16.28), while back-loading causes substantial degradation (17.29) (Bayat et al., 22 Jun 2026).

This pattern indicates that the first layers are information-generating—making orthogonal, novel contributions to the residual stream—while deeper layers are predominantly refining or reinforcing existing content. The gain in efficiency arises from reallocating "wasted" capacity in late layers to areas where it has higher marginal utility.

3. Experimental Evidence across Architectures and Scales

Extensive experimentation validated the TLM principle over multiple configurations.

  • Architectures: Transformer, Gated Attention (softmax plus gating), Hope-attention (self-modifying nested memory), Titans (attention with neural long-term memory)
  • Scales: 440M, 760M, and 1.3B parameters
  • Benchmarks: In-distribution validation (pretrain split), out-of-distribution perplexity (WikiText-103, LAMBADA), commonsense reasoning tasks (LAMBADA, PIQA, HellaSwag, WinoGrande, ARC-easy/challenge, SocialIQA, BoolQ), long-context retrieval (Needle-in-a-Haystack at 4K/8K/16K)

Empirical findings include:

  • Cosine decay outperformed linear and sigmoid schedules at every tested width ratio.
  • The optimal taper for cosine decay (start/end=1.5/0.5) reduced 440M Transformer in-distribution perplexity from 16.28 (uniform) to 14.44.
  • At 760M and 1.3B scales, applying cosine taper yielded lower out-of-distribution perplexity in 15/16 architecture-scale combinations and higher average accuracy (+0.3–1.0 points) on eight commonsense reasoning tasks.
  • Long-context retrieval performance showed no degradation, with slight improvements on more difficult tasks.

A summary table of key results for 760M parameter "Transformer++":

Model Type WikiText Perplexity LAMBADA Perplexity Avg. Accuracy (8 tasks)
Uniform (baseline) 21.86 22.29 52.25
Tapered (cosine) 21.42 21.25 52.84

4. Mechanistic Analysis of Information Flow and Redundancy

Analysis of pretrained GPT-2 models was conducted to measure the degree of novelty introduced by each layer, using layerwise cosine similarity metrics:

  • dC(l+1)≤dC(l)d_C(l+1) \leq d_C(l)2
  • dC(l+1)≤dC(l)d_C(l+1) \leq d_C(l)3

dC(l+1)≤dC(l)d_C(l+1) \leq d_C(l)4 implies an orthogonal, novel contribution, while dC(l+1)≤dC(l)d_C(l+1) \leq d_C(l)5 implies an update aligned with the residual (i.e., redundant). Empirical measurements demonstrated a monotonic rise of both dC(l+1)≤dC(l)d_C(l+1) \leq d_C(l)6 and dC(l+1)≤dC(l)d_C(l+1) \leq d_C(l)7 with depth, indicating that early layers introduce more novelty while late layers reinforce existing information. The implication is that parameter capacity in late layers is underutilized, motivating tapering as a means to reallocate dimensions to where they yield greater functional impact (Bayat et al., 22 Jun 2026).

5. Implementation Guidelines for Tapered LLMs

For practical adoption, the following protocol is recommended:

  • Choose a monotonically decreasing schedule for dC(l+1)≤dC(l)d_C(l+1) \leq d_C(l)8 with dC(l+1)≤dC(l)d_C(l+1) \leq d_C(l)9 and ll0.
  • A cosine schedule with ll1 serves as a robust, architecture-agnostic default.
  • Round per-layer widths to multiples of 16 for compute efficiency.
  • Only taper the MLP width; keep the main residual dimension, head count, and key/value dimensions unchanged.
  • Preserve all other training hyperparameters and data budgets to ensure that observed effects map directly to depth-wise capacity redistribution.
  • For novel model depths, perform a sweep over ll2 ratios in ll3 to verify empiric U-shaped perplexity as a function of taper and select the optimal setting.

6. Comparative Perspective: TLM Principle versus Retrieval-Based TLM (Historical Usage)

It is notable that "TLM" has appeared in previous literature in a distinct context. In "NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework" (Yao et al., 2021), TLM referred to a task-specific training regime—using BM25 retrieval to create a tailored training set from a large corpus, followed by joint optimization of language modeling and supervised objectives. In that framework, "tapering" described the reduction in training data footprint and FLOPs, not per-layer architecture. There, the architectural structure (e.g., BERT encoder) was left unchanged, and efficiency gains stemmed from data selection and objective mixing. The contemporary Tapered LLM principle, as defined in (Bayat et al., 22 Jun 2026), is exclusively architectural, addressing the intralayer dimension schedule under a fixed compute and parameter budget.

7. Applications, Implications, and Limitations

The TLM architectural principle is directly applicable to decoder-only LLMs, and compatible with a wide range of architectures (Transformer, gated attention, hope-attention, Titans) and scales (hundreds of millions to over a billion parameters). Since it preserves the total parameter and compute budget, TLMs can be adopted without resource increase or downstream incompatibility.

A key implication is the existence of a simple, previously overlooked degree of freedom—depth-aware allocation of MLP width—that can be leveraged for improved perplexity and generalization. The empirical U-shaped dependence on taper ratio suggests the need for moderate, schedule-driven tuning. A plausible implication is that even more refined, data-driven allocation schemes (not considered in the cited work) could further enhance efficiency.

No significant degradation has been observed in standard reasoning or long-context retrieval benchmarks. A potential limitation is that the benefits of tapering are closely tied to the alignment between layer function and information addition, which may be architecture- or task-dependent and is best confirmed via empirical sweep in new regimes (Bayat et al., 22 Jun 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Depth-Aware Capacity Allocation.