TreeLoRA: Hierarchical Gradient Similarity Trees
- The paper introduces TreeLoRA, which integrates hierarchical gradient-based task grouping with layer-wise low-rank adapters to enable efficient continual learning with large models.
- It employs a K-D tree structure and bandit-driven similarity estimation to group tasks and minimize catastrophic forgetting while ensuring rapid adaptation.
- Empirical benchmarks on vision transformers and large language models show significant speedups and constant GPU memory usage, validating its scalable design.
Hierarchical Gradient Similarity Trees, known as TreeLoRA, constitute a continual learning framework that integrates hierarchical gradient-based task grouping with efficient, layer-wise parameter adaptation for large pre-trained models (LPMs), such as vision transformers (ViTs) and LLMs. TreeLoRA addresses challenges inherent to streaming environments—most notably the dual imperative to update models quickly for new tasks while minimizing catastrophic forgetting—by combining a K-D tree-based adapter organization, sparse and selective updating, and a bandit-driven strategy for scalable task similarity estimation.
1. Hierarchical Task Structuring via Gradient Similarity
TreeLoRA constructs a hierarchical tree, specifically a K-D tree, to represent relationships between tasks encountered in a continual learning scenario. This tree is incrementally expanded as new tasks are introduced.
- Nodes: The root encapsulates all tasks. Intermediate nodes represent groups of similar tasks, and leaves correspond to individual tasks.
- Task Assignment: Task grouping is determined by the L1-norm similarity of their expected gradient directions. For task gradients , tasks and are grouped if , where is an automatically determined per-split threshold.
- Knowledge Aggregation: Internal nodes facilitate the sharing of knowledge across similar tasks, with higher-tier nodes typically associated with more generic representations.
This hierarchical structure enables efficient task look-up and permits adaptation strategies that respect both global and task-specific learning needs.
2. Layer-wise Low-Rank Adapter Guidance
TreeLoRA leverages the hierarchical tree to direct LoRA-based parameter adaptation at each transformer layer.
- Adapter Sharing: Each node at a given layer is associated with a LoRA module. These modules serve as parameter-efficient adapters and are shared across all descendant subtasks.
- Selective Updating: When a novel task is incorporated, only those adapters along the root-to-leaf path corresponding to the most similar task grouping are candidates for update. This mechanism enforces a logarithmic lookup path instead of a linear search, constraining both GPU memory and storage requirements.
- Knowledge Retention: By localizing updates within relevant adapters, TreeLoRA minimizes negative interference with prior tasks, thereby mitigating catastrophic forgetting.
The result is a significant reduction in update overhead while permitting rapid adaptation to new tasks.
3. Bandit-Based Task Similarity Estimation
A bottleneck in scaling continual learning systems is the cost of computing pairwise similarity between ever-growing sets of tasks. TreeLoRA circumvents this using a Multi-Armed Bandit (MAB) abstraction.
- Arms as Task Groups: Each previously encountered task or group (i.e., tree branch) is regarded as a bandit arm.
- Lower Confidence Bound (LCB): The adaptation pathway is chosen by computing an LCB for each arm:
$\operatorname{LCB}_k = \begin{cases} \mu_k - 2\sqrt{\frac{\log t}{n_k}}, & \text{if $k$ is a leaf} \ \max \big\{ \min_{j \in \mathcal{C}} \mu_j - 2\sqrt{\frac{\log t}{n_j}} - \delta \big\}, & \text{otherwise} \end{cases}$
where is the current similarity estimate, denotes the selection count, and are the children of node .
- Exploration and Exploitation: Only the most promising branch is probed at each step, reducing similarity computation from linear to logarithmic in the number of tasks.
Theoretical analysis demonstrates that the regret bound for this approach improves from (linear search) to , where is the total count of tasks.
4. Sparse Gradient Updates for Large Pre-trained Models
TreeLoRA employs sparse, low-rank updating to optimize adapter parameters efficiently.
- Update Formula: For new task , the update is:
with denoting the low-rank sparsification function (i.e., only the most relevant parameters are updated), and representing the L1-norm of the difference between current and reference gradients.
- Practical Implementation: Standard LoRA modules are used to parameterize each layer. Sparse gradient selection ensures that only a minimal subset of adapter parameters changes for any given new task.
- Resource Impact: Constant GPU memory and nearly constant per-task storage are observed during adaptation, rendering the approach scalable to long task sequences and very large models.
5. Theoretical Analysis and Efficiency Guarantees
TreeLoRA is theoretically justified through formal regret analysis under tree-based task grouping.
- Pseudo-Regret Definition:
- Smooth-Tree Assumption: If intra-branch tasks are sufficiently similar (quantified by smoothness ), the regret bound tightens.
- Main Theoretical Result:
where is the count of nearly-optimal leaves (with suboptimality ).
- Interpretation: For well-structured trees (small ), adaptation cost scales logarithmically in task number .
This formalism underpins both the empirical scalability and the empirical performance of the system.
6. Empirical Performance and Practical Outcomes
TreeLoRA has been benchmarked on both vision and language domain tasks.
- Vision Transformers (ViTs):
- Benchmarks: Split CIFAR-100, Split ImageNet-R, Split CUB-200.
- Baselines: SeqLoRA, GEM, EWC, L2P, DualPrompt, HiDePrompt, HiDeLoRA.
- Outcomes: Highest accuracy and efficiency, achieving up to speedup over the leading baseline. Rapid adaptation—the system reaches 98.8% of final accuracy in just 2–3 epochs.
- LLMs:
- Datasets: TRACE, Math-LLM.
- Models: Mistral-7B, LLaMA-2-7B, Gemma-2B, LLaMA-3.2-1B, LLaMA-13B.
- Outcomes: Consistently achieves top or on-par performance, records lowest forgetting, and attains a speedup relative to HiDeLoRA and O-LoRA.
- Ablations: Both the hierarchical regularization and LCB-driven bandit strategy are essential for optimal results.
- Visualization: Learned trees display natural grouping of similar tasks, e.g., clustering ScienceQA and NumGLUE math tasks.
The empirical results substantiate the theoretical claims and demonstrate cross-domain applicability.
7. Implementation Workflow
The TreeLoRA approach closely follows the structure below:
1 2 3 4 5 6 7 8 9 10 |
for each task n: receive data S_n for each training step t: # Bandit-based Tree Navigation select most promising tree branch via LCB # Sparse LoRA update perform sparse gradient update to task-relevant LoRA adapters update estimated similarity for the branch # End of task record updated adapter(s) and update tree structure accordingly |
This procedure realizes both the computational efficiency and the adaptation selectivity central to the framework.
| Aspect | TreeLoRA Approach | Benefit |
|---|---|---|
| Task grouping | K-D tree via gradient similarity | Efficient grouping, scalable search |
| Adapter type | Layer-wise, bandit-guided LoRA adapters | Shared & task-specific parameters |
| Similarity Est. | Bandit search (LCB) | Logarithmic search time, efficiency |
| Parameter Upd. | Sparse gradient, low-rank | Constant memory, efficient updates |
| Theory | Regret bounds | Provable efficiency |
| Experiment | Top performance/speedup on ViTs, LLMs | Practical, scalable CL |
TreeLoRA advances the field of continual learning for large pre-trained models by synthesizing hierarchical task clustering, efficient parameter adaptation, and scalable task selection underpinned by regret analysis. The method achieves superior empirical results in both accuracy and computational efficiency, particularly in scenarios that require long task sequences and adaptation without catastrophic forgetting (Qian et al., 12 Jun 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free