Papers
Topics
Authors
Recent
2000 character limit reached

Hierarchical Token Prepending in Transformers

Updated 20 November 2025
  • HTP is a method that augments decoder-only transformers with both local and global summary tokens to enhance information flow and reduce information bottlenecks.
  • By segmenting inputs into blocks and dynamically updating summary tokens at each layer, HTP delivers more robust, scalable, and interpretable embeddings than traditional token prepending.
  • Empirical results demonstrate improved retrieval and classification performance across long-context scenarios with only moderate computational overhead.

Hierarchical Token Prepending (HTP) is a methodology for augmenting transformer architectures—principally in decoder-only LLMs—to enhance information flow through explicit introduction of multiple summary tokens at different hierarchical levels of the input. HTP addresses the inherent information bottlenecks imposed by causal attention, improving representation quality, especially for long-context embeddings used in tasks such as retrieval or classification. It generalizes and supersedes single-token prepending by coupling blockwise summary generation with hierarchical token placement and mean-pooling, resulting in more robust, scalable, and interpretable embeddings for both zero-shot and finetuned settings (Ding et al., 18 Nov 2025). Related approaches in the vision domain leverage hierarchical prompts for morphological discrimination in hierarchical image classification (Wang et al., 2023). The following sections detail core mechanisms, theoretical motivations, architectural workflows, practical implications, and empirical results.

1. Theoretical Motivation and Bottlenecks

Decoder-only transformer models with causal masking restrict every token to attend only to preceding tokens. The mask

αij()=0j>i\alpha_{ij}^{(\ell)} = 0 \quad \forall j > i

prevents any token ii from accessing future context, so early token representations hi(L)h_i^{(L)} are deprived of later input information. This architectural constraint degrades embedding quality as sequence lengths increase. A standard solution is Token Prepending (TP): inserting a single summary token, which is typically fed the last-layer hidden state of the final token, to the front of the sequence. However, this approach creates a compression bottleneck, as information from an entire long input must be condensed into a single vector, causing “over-squashing” of semantic content and limiting downstream performance (Ding et al., 18 Nov 2025).

2. Hierarchical Token Construction and Placement

HTP improves upon TP by segmenting the input sequence

x=(x1,x2,,xL)x = (x_1, x_2, \dots, x_L)

into NN contiguous blocks BiB_i of approximately equal size KK:

Bi=(x(i1)K+1,x(i1)K+2,,xiK),i=1,,NB_i = (x_{(i-1)K+1}, x_{(i-1)K+2}, \dots, x_{iK}), \quad i=1,\dots,N

For each block BiB_i, HTP introduces a “placeholder” summary token PSTi\langle\mathrm{PST}\rangle_i directly into the token sequence. This summary token is dynamically populated at each transformer layer (for layers >1\ell>1) by copying the final hidden state of its corresponding block:

si()=hend(Bi)(),h~pos(PSTi)()=si()s_i^{(\ell)} = h^{(\ell)}_{\mathrm{end}(B_i)}, \quad \tilde h^{(\ell)}_{\mathrm{pos}(\langle\mathrm{PST}\rangle_i)} = s_i^{(\ell)}

Additionally, a higher-level of NN “global” summary tokens, B ⁣ ⁣PST1,,B ⁣ ⁣PSTN\langle\mathrm{B\!-\!PST}\rangle_1, \ldots, \langle\mathrm{B\!-\!PST}\rangle_N, is prepended to the front of the sequence. These global tokens are themselves populated by copying information from the local blockwise summaries, ensuring that all block-level summaries are accessible from the very beginning of the sequence to every subsequent token, including those in later blocks. This architecture introduces $2N$ backward “edges” per sequence, rather than a single prepended summary, thereby offering much richer pathways for backward information flow (Ding et al., 18 Nov 2025).

3. Attention Mask Modification and Information Flow

The original causal mask is a strict lower-triangular matrix. HTP modifies this mask to enable every downstream token to attend to both local and global summary tokens regardless of their chronological position. Specifically, for each PSTi\langle\mathrm{PST}\rangle_i and global B ⁣ ⁣PSTi\langle\mathrm{B\!-\!PST}\rangle_i,

$M_{j,\;\pos(\langle\mathrm{PST}\rangle_i)} = 1, \qquad M_{j,\;\pos(\langle\mathrm{B\!-\!PST}\rangle_i)} = 1 \qquad \forall\, j > \pos(\langle\mathrm{PST}\rangle_i)$

This reconfiguration ensures that representations can propagate not only along the default causal path but also through multiple explicit hierarchical shortcuts, mitigating vanishing transport and the loss of global information in deep transformers. This suggests improved long-range dependency modeling with limited additional computation (≈1.1–1.3× speed/memory overhead) (Ding et al., 18 Nov 2025).

4. Readout Strategy and Robustness

Standard last-token pooling, which extracts the embedding from the final token hn(L)h_n^{(L)}, is highly sensitive to the attention mask and vulnerable to depth-induced attenuation. HTP replaces this step with mean-pooling across all (rewired) output tokens at a chosen exit layer LL':

e=1Lt=1Lht(L)e = \frac{1}{L'} \sum_{t=1}^{L'} h^{(L')}_t

Theoretical analysis demonstrates that, whereas last-token sensitivity decays rapidly with network depth (due to the lower-triangular propagation matrix AA), mean-pooling sustains high sensitivity, as gradients sum over all paths:

yˉvi(0)KLLj=1LAj,i\left\|\frac{\partial \bar y}{\partial v_i^{(0)}}\right\| \leq \frac{K_L}{L'} \sum_{j=1}^{L'} A_{j,i}

This pooling enhancement is essential for robust representation learning in long-context scenarios (Ding et al., 18 Nov 2025).

5. End-to-End Algorithmic Workflow

The HTP workflow is summarized as follows:

  • Partition input tokens into NN blocks of size KK
  • Build the input sequence by interleaving global and local summary tokens:

[B ⁣ ⁣PST1,,B ⁣ ⁣PSTN,PST1,B1,PST2,B2,,PSTN,BN][\, \mathrm{B\!-\!PST}_1,\ldots,\mathrm{B\!-\!PST}_N,\, \mathrm{PST}_1,\, B_1,\, \mathrm{PST}_2,\, B_2,\,\ldots,\, \mathrm{PST}_N,\, B_N \,]

  • Randomly initialize all summary token embeddings
  • For each transformer layer:
    • Apply attention and MLP as usual
    • For 1<L1 < \ell \leq L' (where LL' is the “early exit” layer):
    • Update each PSTi\langle\mathrm{PST}\rangle_i with the final embedding of BiB_i
    • Update each B ⁣ ⁣PSTi\langle\mathrm{B\!-\!PST}\rangle_i with its corresponding PSTi\langle\mathrm{PST}\rangle_i
  • At the output, mean-pool the final hidden states of all tokens up to layer LL'

The process remains architecture-agnostic and is compatible with both zero-shot and finetuned transformer models (Ding et al., 18 Nov 2025).

6. Empirical Performance and Ablation Results

HTP demonstrates significant performance gains across diverse retrieval tasks and embedding benchmarks. In BEIR retrieval (context \leq512), HTP improves NDCG@10 scores to 30.4 (Mistral-7B) vs. 27.8 (vanilla mean) and 17.7 (TP), and to 27.4 (Gemma2-9B) vs. 25.5 (vanilla mean). On LongEmbed datasets with input lengths up to 8192, HTP maintains or surpasses vanilla performance (e.g., 49.29 on Gemma2-9B vs. 44.06 vanilla mean), while TP collapses under long-context (18.15) (Ding et al., 18 Nov 2025).

Ablative studies on block size KK indicate: | Input Length | Optimal KK | Notes | |--------------|-------------|------------------------------------------| | ≤ 512 | 1 | Dense shortcutting preferred | | Up to 16k | 2 or 4 | Larger blocks reduce summary OOD effects |

Additionally, the memory and run-time overheads remain moderate (1.1–1.3×), and robustness to block partitioning indicates the flexibility of HTP for different context lengths (Ding et al., 18 Nov 2025).

7. Extensions to Vision Transformers and Other Domains

A related instantiation of hierarchical token mechanisms in vision is presented in "TransHP: Image Classification with Hierarchical Prompting," which leverages hierarchical prompt embedding and injection for hierarchical image classification. There, a prompt-pool for coarse class labels is predicted and injected at a specified transformer block, conditioning fine-class discrimination on explicit coarse-class hints. Empirical gains in classification accuracy (e.g., +2.83% for ViT-B/16 on ImageNet) demonstrate the effectiveness of hierarchical token strategies across modalities (Wang et al., 2023). A plausible implication is that explicit hierarchical token conditioning generalizes across language and vision contexts to facilitate fine-grained discrimination and robust representation learning.


References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Hierarchical Token Prepending (HTP).