Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 97 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 37 tok/s
GPT-5 High 28 tok/s Pro
GPT-4o 110 tok/s
GPT OSS 120B 468 tok/s Pro
Kimi K2 236 tok/s Pro
2000 character limit reached

Hi-Vec: Hierarchical Adaptive Networks with Task Vectors

Updated 18 August 2025
  • Hi-Vec is a modular, hierarchical adaptive network framework that uses task-specific vectors to guide layer-wise adaptation in multi-task scenarios.
  • It dynamically selects optimal layers by minimizing adaptation loss gradients and merges weights based on task vector similarity to achieve robust test-time performance.
  • The architecture ensures parameter efficiency and scalability through low-rank compression, weight merging, and consensus mechanisms, proving effective across diverse benchmarks.

Hierarchical Adaptive Networks with Task Vectors (Hi-Vec) are a modular framework for dynamic adaptation of neural networks, particularly targeting scenarios that involve incremental multi-task adaptation, robust test-time generalization, and scalable parameter-efficient model fusion. Hi-Vec leverages hierarchical architectural structures—typically ensembles of layer-wise or feature-wise modules—that are adaptively modulated by compact, task-specific vectors. These vectors encode explicit or implicit task information to guide model behavior across domains, tasks, or data distributions.

1. Hierarchical Organization and Layer-Wise Structures

Hi-Vec architectures instantiate hierarchy by attaching multiple linear or nonlinear layers that operate at different granularities over the encoder's latent representations (Ambekar et al., 11 Aug 2025). For a feature vector zRdz \in \mathbb{R}^d computed by a backbone, the framework decomposes this representation into ordered subsets:

{z1:kRkkM},m1<m2<<mK=d\{z_{1:k} \in \mathbb{R}^k \mid k \in \mathcal{M}\}, \quad m_1 < m_2 < \ldots < m_K = d

Each subset feeds into corresponding classifier or adapter layers, enabling coarse-to-fine mappings from latent space to output predictions. Task vectors annotate each hierarchical layer, defining either explicit geometric signatures (e.g., singular vectors (Gargiulo et al., 26 Nov 2024), cluster centroids (Chennupati et al., 2020)) or implicit modulation signals (e.g., hypernetwork-generated parameters (Jin et al., 2023), query-specific vectors (Kang et al., 3 Jun 2025), binary masks (Mancini et al., 2018)). This design supports dynamic routing, selective adaptation, and differentiated learning across scales of abstraction.

2. Dynamic Layer Selection and Test-Time Adaptation

Standard adaptation methods often fine-tune a single linear layer per test batch. Hi-Vec introduces dynamic selection: for each incoming batch, the optimal hierarchical layer ϕ\phi^* is identified by minimizing the gradient norm of an unsupervised adaptation loss (such as entropy minimization):

ϕ=argminϕΦ WϕLt\phi^* = \underset{\phi \in \Phi}{\arg\min} \ \|\nabla_{W_\phi} \mathcal{L}_t\|

This enables the selection of the most relevant “scale” (i.e., feature subset) for adaptation against domain shifts of varying complexity (Ambekar et al., 11 Aug 2025). Successive layers, paired with their task vectors, may capture different aspects of the distribution, from global context to fine-grained variations, improving coverage for diverse target scenarios.

3. Weight Merging, Task Vector Propagation, and Agreement Mechanisms

After dynamically adapting a selected layer, Hi-Vec propagates the modified weights or representations to other layers displaying high task vector similarity. This "weight merging" ensures that target-specific information permeates through the network, maintaining cross-layer coherence:

WϕWϕ+αprojWϕ(Wϕ)W_\phi \leftarrow W_\phi + \alpha \cdot \text{proj}_{W_\phi}(W_{\phi^*})

where similarity is measured, for example, by cosine similarity of associated task vectors. To avoid erroneous adaptation on noisy or adversarial batches, Hi-Vec implements "linear layer agreement": the mutual information I(p,pϕ)I(p^*, p_\phi) between selected and other layer outputs serves as a gating variable. Batches with low average agreement (below a threshold τOOD\tau_{OOD}) are skipped, preventing error accumulation and instability (Ambekar et al., 11 Aug 2025).

4. Task Vector Principles and Low-Rank Compression

Task vectors in Hi-Vec serve as modulating signals for adaptation, classification, retrieval, and fusion. They are constructed by diverse mechanisms:

  • Affine binary-masked transformations W~=k0W+k11+k2M\tilde{W} = k_0 W + k_1 \mathbf{1} + k_2 M (where MM is a learned binary mask) provide flexible, parameter-efficient adaptation by controlling weight utilization and bias (Mancini et al., 2018).
  • Singular Value Decomposition (SVD): Task vectors are realized as principal singular vectors (TSVs) from per-layer weight difference matrices, capturing dominant directions of task adaptation. Low-rank compression (TSV-Compress) retains only top kk components, reducing storage by up to 90% with negligible accuracy loss (Gargiulo et al., 26 Nov 2024).
  • Latent embeddings: In hierarchical meta-learning, task vectors are latent data or trajectory representations that allow selectors or experts to specialize efficiently (Hihn et al., 2019).
  • Attention-based task summaries: Hierarchical vision-language representations fuse features across layers, extracting task vectors as intermediate summary embeddings for each objective (Nguyen et al., 2018).
  • Hypernetwork-generated parameters: Task vectors (possibly constructed from IDs or language) condition hypernetworks to synthesize parameters for downstream modules (Jin et al., 2023).
  • Adaptive query-conditioned vectors: A small LLM generates a vector per input query, which is expanded and injected into LLM layers for dynamic modulation (Kang et al., 3 Jun 2025).

Task vectors generally provide fine-grained, interpretable control over adaptation, enable modular layer-wise routing, and facilitate robust multi-task integration. The explicit construction (e.g., using SVD) also enables measurement of interference between tasks and subsequent decorrelation, e.g., with whitening transforms (Gargiulo et al., 26 Nov 2024).

5. Scalability, Parameter Efficiency, and Practical Deployment

Hi-Vec is explicitly designed to minimize per-task parameter overhead, benefiting incremental learning and multi-task scaling. Binary mask strategies add only \approx1 bit per parameter per task (Mancini et al., 2018). Hierarchical adapters with shared recurrent controllers and small task-specific heads further reduce parameter growth (Munkhdalai et al., 25 Mar 2024). Model compression via task vector low-rank approximation enables merging many tasks while preserving up to 99% of individual accuracies, with only minor losses for large ensembles (Gargiulo et al., 26 Nov 2024).

Cluster-informed modular decompositions enable scalable training and deployment in tasks with tens of thousands of classes; distributed classifier selection and parallel training are facilitated by cluster-level task vectors (Chennupati et al., 2020). In hierarchical expert networks, information-theoretic control over partitioning and specialization guarantees that only meaningful distinctions are encoded, preventing wasted capacity (Hihn et al., 2019).

6. Applications and Benchmarks

Hi-Vec frameworks have demonstrated empirical gains in:

  • Test-time adaptation: Enhanced robustness and accuracy under outliers and distribution shifts on image and tabular benchmarks (e.g., CIFAR-10-C, WaterBirds, ColoredMNIST) (Ambekar et al., 11 Aug 2025).
  • Incremental learning: Avoidance of catastrophic forgetting and low-cost scalability on multi-domain recognition challenges (e.g., Visual Decathlon) (Mancini et al., 2018).
  • Multi-task vision-language learning: Improved performance and cross-task generalization through hierarchical fusion and attention map visualization (Nguyen et al., 2018).
  • Meta-learning: Faster adaptation and lower generalization error in image classification, regression, and sequential reinforcement learning (Hihn et al., 2019).
  • Speech recognition: Parameter-efficient adaptation to hundreds of speakers, outperforming full fine-tuning and adapter baselines (Munkhdalai et al., 25 Mar 2024).
  • Model-merging: TSV-based layer-wise merging reaches up to ~97% normalized accuracy when aggregating up to 20 tasks, outperforming vector arithmetic or consensus methods by ~15 percentage points (Gargiulo et al., 26 Nov 2024).
  • Dynamic LLM steering: Query-adaptive task vector injection achieves higher accuracy and generalization than in-context learning, LoRA, or prefix-tuning (Kang et al., 3 Jun 2025).
  • Reinforcement learning: Hypernetwork-directed task-adaptive retrieval modules augment policy learning speed and sample efficiency (Jin et al., 2023).

7. Outlook and Future Directions

Current work suggests several forward-looking opportunities:

  • Hierarchical, multi-level adaptive frameworks may improve granularity of contextual adaptation, not only in traditional vision and language tasks but also in multi-modal, continual, and open-world learning.
  • Modular architectures combining fixed, compressed, and dynamically generated task vectors are poised to efficiently address scaling and transfer challenges.
  • Task vector-based measures of interference and adaptive compression (using rank-selection and whitening) can enhance robustness in large model fusion and lifelong learning.
  • Plug-and-play extension of Hi-Vec concepts to parameter-free and non-gradient adaptation strategies may further broaden real-world applicability.

A plausible implication is that further refinement of hierarchical adaptation, dynamic routing, and task-specific modulation through compact task vectors will substantially improve neural network portability, generalization, and efficiency in diverse domains.