Hierarchical Layer-Grouped Prompt Tuning

Updated 19 November 2025

The paper introduces a novel method for continual learning in Vision Transformers by hierarchical prompt grouping that mitigates catastrophic forgetting.
Hierarchical Layer-Grouped Prompt Tuning is a continual learning paradigm coordinating a root prompt with group-specific sub-prompts to adapt layer features efficiently.
It enables strong task adaptation and improved performance on benchmarks like CIFAR-100 and ImageNet-R with minimal tunable parameters and reduced cross-task interference.

Hierarchical Layer-Grouped Prompt Tuning is a continual learning paradigm for Transformer-based architectures, particularly in vision models (e.g., ViT), that introduces a hierarchical coordination between prompts injected at different layers. Rather than learning fully independent prompts per layer or globally shared task-specific prompts, this method structures prompt adaptation through the generation of layer-grouped sub-prompts from a single root prompt per task. Such architectural design enables strong task adaptation, mitigates catastrophic forgetting, and achieves high continual learning performance with limited tunable parameters and minimal risk of cross-task interference (Jiang et al., 15 Nov 2025).

1. Formal Structure and Root Prompt Definition

A central construct is the “task-specific root prompt,” denoted

$P_{\mathrm{root}}^{(t)} = \{k_t, v_t\}$

where $k_t, v_t \in \mathbb{R}^{L \times D}$ are the key and value prompts, with $L$ the prompt length and $D$ the hidden dimension (e.g., $L=10$ , $D=768$ in ViT-B configurations). This root prompt is initialized with i.i.d. Gaussian entries for the first task. For subsequent tasks ( $t > 1$ ), both the root prompt and the prompt generator weights are copied from the previous task, forming a continual adaptation trajectory. The root prompt serves as the conditioning vector for the generation of all internal sub-prompts across the architecture.

2. Generator Network and Sub-Prompt Construction

The ViT’s $m$ Transformer layers are partitioned into $G$ disjoint "groups" such that all layers in a group share a single group-specific sub-prompt. For group $g$ , a learnable position encoding $E^g_{\mathrm{pos}} \in \mathbb{R}^{d_{\mathrm{pos}}}$ is concatenated with the vectorized $k_t$ or $v_t$ before passing through a two-layer MLP*

$\begin{align*} z^g &= [\mathrm{vec}(k_t);\,E^g_{\mathrm{pos}}] \ h^g_k &= \mathrm{ReLU}(W^g_{1,k}\,z^g + b^g_{1,k}) \in \mathbb{R}^r \ \theta_t^g &= W^g_{2,k}\,h^g_k + b^g_{2,k} \in \mathbb{R}^{L D} \ \end{align*}$

(with similar operations yielding value sub-prompts). The bottleneck rank is set to $r=16$ . These group-level vectors are then broadcasted and, for each layer $i$ in group $g$ , position incentive embeddings $\beta_t^i$ are added: $\phi_t^i = \mathrm{unvec}(\theta_t^{g(i)}) + \beta_t^i$ serving as the per-layer prompt tokens inserted into the attention blocks for the task.

This hierarchical generation ensures all prompts maintain shared context via $P_{\mathrm{root}}^{(t)}$ , but with localized adaptation by group and layer to preserve feature propagation pathways and minimize redundant or destabilizing updates (Jiang et al., 15 Nov 2025).

3. Inference, Soft Task Matching, and Prompt Aggregation

At test time, “soft task matching” is employed. For an input sample $x$ , each prior task’s prompt yields a classifier score: $s_j(x) = \sum_{c \in \mathcal{Y}_j} p(c|x ; P_{\mathrm{agg}}^{(j)})$ These scores are used to compute normalized weights $\alpha_j$ for each task: $\alpha_j = \frac{\exp(s_j(x))}{\sum_{k=1}^t \exp(s_k(x))}$ For every layer $i$ , the final injected prompt is a convex combination: $P_{\mathrm{agg}}^{(i)} = \sum_{j=1}^t \alpha_j P_j^i$ yielding an input-driven, data-adaptive aggregation of task-specialized knowledge, suppressing hard overwriting of older tasks.

4. Training Objective and Continual Learning Protocol

Hierarchical layer-grouped prompt tuning employs a “frozen backbone” regime: core ViT parameters are immutable, and only the root prompt, group-adapter MLPs, and task classifier for the current task are trained. For each new task, training proceeds by minimizing the cross-entropy loss: $\mathcal{L}_{\mathrm{CE}} = -\frac{1}{N_t}\sum_{(x,y)\in\mathcal{D}_t} \log p\!\bigl(y\mid x;P^1_t,\ldots,P^m_t,W_t\bigr)$ Earlier tasks’ root prompts and generator parameters are fixed throughout - only parameters for the current task are updated, which prevents catastrophic forgetting via parameter isolation. No Fisher or explicit orthogonality regularization is introduced; the hierarchy and sharing suffice to stabilize knowledge (Jiang et al., 15 Nov 2025).

5. Algorithmic Workflow and Pseudocode

The learning and inference process is encapsulated in the following structure:

Initialization: For task $t>1$ , copy $P_{\mathrm{root}}^{(t-1)}$ and $\Theta_{\mathrm{gen}}^{(t-1)}$ .
Mini-batch Loop: For each batch in $\mathcal{D}_t$ $D_{t}$ ,
- Generate group sub-prompts using corresponding MLPs.
- Construct per-layer prompts by adding layerwise position incentives.
- Forward pass through a ViT with all prompts injected.
- Compute classification loss; backpropagate and update only prompt-related parameters.
Inference: Compute task scores, softmax aggregation weights, and prompt fusion as above to adapt all sub-prompts for the current test sample.

6. Empirical Results and Comparative Performance

Hierarchical layer-grouped prompt tuning surpasses independent layerwise prompt tuning and adapter-based baselines on standard vision continual learning benchmarks. For example:

Method	CIFAR-100 FAA (%)	ImageNet-R FAA (%)
CA-Prompt	95.35	80.22
SEMA (Adapters)	87.84	—
Ours	97.59	84.10

The improvements originate from the stability and efficiency conferred by prompt grouping and coordinated generation. Sharing sub-prompts by group and generating from a single root prompt imposes an implicit structural regularization, reducing overwriting of features important to previously learned tasks (Jiang et al., 15 Nov 2025).

7. Significance, Context, and Extensions

This hierarchical organization addresses a key trade-off in prompt-based continual learning: fully independent per-layer prompts offer too much flexibility, increasing risk of feature erasure, while a global prompt constrains adaptability. Layer-grouped generation mediated by a root prompt enables strong local adaptation with global coordination.

A plausible implication is the potential extension to multi-modal architectures, or integration with mixture-of-expert prompt frameworks, leveraging root-prompts for more robust and parameter-efficient lifelong learning scenarios.

No controversy is reported in the literature regarding the benefit of hierarchical prompt sharing for stability. The mechanism directly addresses prior empirical findings that “adding prompts independently at each layer… increases the risk of catastrophic forgetting” due to uncontrolled updates (Jiang et al., 15 Nov 2025).

References:

Teaching Prompts to Coordinate: Hierarchical Layer-Grouped Prompt Tuning for Continual Learning (Jiang et al., 15 Nov 2025)

PDF Markdown Chat (Pro)

References (1)

Teaching Prompts to Coordinate: Hierarchical Layer-Grouped Prompt Tuning for Continual Learning (2025)

Follow Topic

Get notified by email when new papers are published related to Hierarchical Layer-Grouped Prompt Tuning.