Papers
Topics
Authors
Recent
2000 character limit reached

TreeLoRA: Hierarchical Gradient Similarity Trees

Updated 5 November 2025
  • The paper introduces TreeLoRA, which integrates hierarchical gradient-based task grouping with layer-wise low-rank adapters to enable efficient continual learning with large models.
  • It employs a K-D tree structure and bandit-driven similarity estimation to group tasks and minimize catastrophic forgetting while ensuring rapid adaptation.
  • Empirical benchmarks on vision transformers and large language models show significant speedups and constant GPU memory usage, validating its scalable design.

Hierarchical Gradient Similarity Trees, known as TreeLoRA, constitute a continual learning framework that integrates hierarchical gradient-based task grouping with efficient, layer-wise parameter adaptation for large pre-trained models (LPMs), such as vision transformers (ViTs) and LLMs. TreeLoRA addresses challenges inherent to streaming environments—most notably the dual imperative to update models quickly for new tasks while minimizing catastrophic forgetting—by combining a K-D tree-based adapter organization, sparse and selective updating, and a bandit-driven strategy for scalable task similarity estimation.

1. Hierarchical Task Structuring via Gradient Similarity

TreeLoRA constructs a hierarchical tree, specifically a K-D tree, to represent relationships between tasks encountered in a continual learning scenario. This tree is incrementally expanded as new tasks are introduced.

  • Nodes: The root encapsulates all tasks. Intermediate nodes represent groups of similar tasks, and leaves correspond to individual tasks.
  • Task Assignment: Task grouping is determined by the L1-norm similarity of their expected gradient directions. For task gradients gk=E(x,y)Dn[(wk;x,y)]\mathbf{g}_k = \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}_n}\left[\nabla \ell(\mathbf{w}_k; \mathbf{x}, y)\right], tasks ii and ii' are grouped if gigi1δ\|\mathbf{g}_i - \mathbf{g}_{i'}\|_1 \leq \delta, where δ\delta is an automatically determined per-split threshold.
  • Knowledge Aggregation: Internal nodes facilitate the sharing of knowledge across similar tasks, with higher-tier nodes typically associated with more generic representations.

This hierarchical structure enables efficient task look-up and permits adaptation strategies that respect both global and task-specific learning needs.

2. Layer-wise Low-Rank Adapter Guidance

TreeLoRA leverages the hierarchical tree to direct LoRA-based parameter adaptation at each transformer layer.

  • Adapter Sharing: Each node at a given layer is associated with a LoRA module. These modules serve as parameter-efficient adapters and are shared across all descendant subtasks.
  • Selective Updating: When a novel task is incorporated, only those adapters along the root-to-leaf path corresponding to the most similar task grouping are candidates for update. This mechanism enforces a logarithmic lookup path instead of a linear search, constraining both GPU memory and storage requirements.
  • Knowledge Retention: By localizing updates within relevant adapters, TreeLoRA minimizes negative interference with prior tasks, thereby mitigating catastrophic forgetting.

The result is a significant reduction in update overhead while permitting rapid adaptation to new tasks.

3. Bandit-Based Task Similarity Estimation

A bottleneck in scaling continual learning systems is the cost of computing pairwise similarity between ever-growing sets of tasks. TreeLoRA circumvents this using a Multi-Armed Bandit (MAB) abstraction.

  • Arms as Task Groups: Each previously encountered task or group (i.e., tree branch) is regarded as a bandit arm.
  • Lower Confidence Bound (LCB): The adaptation pathway is chosen by computing an LCB for each arm:

$\operatorname{LCB}_k = \begin{cases} \mu_k - 2\sqrt{\frac{\log t}{n_k}}, & \text{if $k$ is a leaf} \ \max \big\{ \min_{j \in \mathcal{C}} \mu_j - 2\sqrt{\frac{\log t}{n_j}} - \delta \big\}, & \text{otherwise} \end{cases}$

where μk\mu_k is the current similarity estimate, nkn_k denotes the selection count, and C\mathcal{C} are the children of node kk.

  • Exploration and Exploitation: Only the most promising branch is probed at each step, reducing similarity computation from linear to logarithmic in the number of tasks.

Theoretical analysis demonstrates that the regret bound for this approach improves from O(N)\mathcal{O}(\sqrt{N}) (linear search) to O(logN)\mathcal{O}(\sqrt{\log N}), where NN is the total count of tasks.

4. Sparse Gradient Updates for Large Pre-trained Models

TreeLoRA employs sparse, low-rank updating to optimize adapter parameters efficiently.

  • Update Formula: For new task nn, the update is:

wnt+1=wntαS{(wnt;xt,yt)λregt}\mathbf{w}_n^{t+1} = \mathbf{w}_n^t - \alpha \cdot S\left\{\nabla \ell(\mathbf{w}_n^t; \mathbf{x}_t, y_t) - \lambda \cdot \nabla \ell_{\mathrm{reg}}^t\right\}

with S[]S[\cdot] denoting the low-rank sparsification function (i.e., only the most relevant parameters are updated), and regt=ξtk1\ell_{\mathrm{reg}}^t = |\xi_t^k|_1 representing the L1-norm of the difference between current and reference gradients.

  • Practical Implementation: Standard LoRA modules are used to parameterize each layer. Sparse gradient selection ensures that only a minimal subset of adapter parameters changes for any given new task.
  • Resource Impact: Constant GPU memory and nearly constant per-task storage are observed during adaptation, rendering the approach scalable to long task sequences and very large models.

5. Theoretical Analysis and Efficiency Guarantees

TreeLoRA is theoretically justified through formal regret analysis under tree-based task grouping.

  • Pseudo-Regret Definition:

Reg(T)=E[t=1Tξtkmink[N]t=1Tξk]\mathrm{Reg}(T) = \mathbb{E} \left[ \sum_{t=1}^T \xi_t^k - \min_{k^\star \in[N]} \sum_{t=1}^T \xi^{k^\star} \right]

  • Smooth-Tree Assumption: If intra-branch tasks are sufficiently similar (quantified by smoothness δd\delta_d), the regret bound tightens.
  • Main Theoretical Result:

Reg(T)O(TJηlogNTJη+δcη2+clogNTη2)\mathrm{Reg}(T) \leq \mathcal{O}\left( \sqrt{T |J_\eta| \log \frac{N T}{|J_\eta|}} + \frac{\delta^c}{\eta^{2+c}} \log \frac{N T}{\eta^2} \right)

where Jη|J_\eta| is the count of nearly-optimal leaves (with suboptimality η\leq \eta).

  • Interpretation: For well-structured trees (small Jη|J_\eta|), adaptation cost scales logarithmically in task number NN.

This formalism underpins both the empirical scalability and the empirical performance of the system.

6. Empirical Performance and Practical Outcomes

TreeLoRA has been benchmarked on both vision and language domain tasks.

  • Vision Transformers (ViTs):
    • Benchmarks: Split CIFAR-100, Split ImageNet-R, Split CUB-200.
    • Baselines: SeqLoRA, GEM, EWC, L2P, DualPrompt, HiDePrompt, HiDeLoRA.
    • Outcomes: Highest accuracy and efficiency, achieving up to 3.2×3.2\times speedup over the leading baseline. Rapid adaptation—the system reaches 98.8% of final accuracy in just 2–3 epochs.
  • LLMs:
    • Datasets: TRACE, Math-LLM.
    • Models: Mistral-7B, LLaMA-2-7B, Gemma-2B, LLaMA-3.2-1B, LLaMA-13B.
    • Outcomes: Consistently achieves top or on-par performance, records lowest forgetting, and attains a 2.4×2.4\times speedup relative to HiDeLoRA and O-LoRA.
    • Ablations: Both the hierarchical regularization and LCB-driven bandit strategy are essential for optimal results.
  • Visualization: Learned trees display natural grouping of similar tasks, e.g., clustering ScienceQA and NumGLUE math tasks.

The empirical results substantiate the theoretical claims and demonstrate cross-domain applicability.

7. Implementation Workflow

The TreeLoRA approach closely follows the structure below:

1
2
3
4
5
6
7
8
9
10
for each task n:
    receive data S_n
    for each training step t:
        # Bandit-based Tree Navigation
        select most promising tree branch via LCB
        # Sparse LoRA update
        perform sparse gradient update to task-relevant LoRA adapters
        update estimated similarity for the branch
    # End of task
    record updated adapter(s) and update tree structure accordingly

This procedure realizes both the computational efficiency and the adaptation selectivity central to the framework.


Aspect TreeLoRA Approach Benefit
Task grouping K-D tree via gradient similarity Efficient grouping, scalable search
Adapter type Layer-wise, bandit-guided LoRA adapters Shared & task-specific parameters
Similarity Est. Bandit search (LCB) Logarithmic search time, efficiency
Parameter Upd. Sparse gradient, low-rank Constant memory, efficient updates
Theory Regret bounds O(logN)\mathcal{O}(\log N) Provable efficiency
Experiment Top performance/speedup on ViTs, LLMs Practical, scalable CL

TreeLoRA advances the field of continual learning for large pre-trained models by synthesizing hierarchical task clustering, efficient parameter adaptation, and scalable task selection underpinned by regret analysis. The method achieves superior empirical results in both accuracy and computational efficiency, particularly in scenarios that require long task sequences and adaptation without catastrophic forgetting (Qian et al., 12 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Hierarchical Gradient Similarity Trees (TreeLoRA).