- The paper introduces HeadRouter, which dynamically assigns head weights to adaptively prune audio tokens, ensuring minimal performance loss.
- Its novel selectivity-driven routing mechanism leverages per-sample statistics to differentiate between semantic and acoustic tasks.
- Experimental results show improved F1-scores and reduced memory footprint, outperforming existing audio token compression methods.
HeadRouter: Task-Adaptive Dynamic Head-Weight Routing for Audio Token Pruning in LALMs
Motivation and Challenges in Audio Token Compression
Large audio-LLMs (LALMs) have shown efficacy in directly coupling audio encoders with pretrained LLMs, circumventing the limitations of cascade systems such as ASR-to-text pipelines. However, the extended context and high frame rates intrinsic to audio data induce extremely long token sequences, exacerbating inference latency and increasing KV-cache memory demands. This substantially limits the practicality of deploying LALMs in scenarios requiring long-context reasoning.
Existing token compression approaches—similarity-based, temporal-based, and attention-based—are not optimized for the audio modality's inherent heterogeneity. These methods generally assume uniform importance distribution across attention heads, often averaging head-wise scores, and fail to consider the distinct behaviors of heads depending on whether an audio task is semantic (e.g., ASR, speech entity recognition) or acoustic (e.g., event detection, speaker identification).
Figure 1: Task categorization in AudioMarathon: acoustic vs. semantic domains.
Such task-level heterogeneity, as visually summarized, highlights that no single pruning profile can robustly serve the spectrum of audio tasks. Furthermore, prior techniques exhibit adverse biases, with similarity-based approaches discarding meaningful redundancy, temporal methods retaining noisy content, and existing attention-based schemes inheriting strong positional bias, typically retaining token tails regardless of content.
Empirical Analysis of Head-Behavior Heterogeneity
Comprehensive analysis on benchmarks like AudioMarathon identifies pronounced head-behavior heterogeneity:
- Cluster Divergence: t-SNE visualizations of head selectivity vectors reveal clear separation between semantic and acoustic tasks, with mixed-nature tasks occupying intermediate regions.
Figure 2: t-SNE projection of per-sample 16D head selectivity vectors; clear task clustering demonstrates head-behavior divergence.
- Selectivity Patterns: Acoustic tasks activate a sparse subset of highly selective heads, while semantic tasks yield distributed, homogeneous selectivity across all heads.
Figure 3: Heatmap of head selectivity over AudioMarathon tasks, contrasting concentrated head usage in acoustic tasks with distributed semantic task patterns.
- Token Selection Visualizations: Compared against an oracle (energy-based) pruning strategy, existing methods are misaligned in salient token selection; the need for task- and head-adaptive pruning is reiterated.
Figure 4: Example token selection across methods; HeadRouter aligns more closely with oracle energy patterns.
These findings clarify the inadequacy of one-size-fits-all pruning and motivate leveraging statistically robust, input-adaptive head-weight assignment.
The HeadRouter Mechanism
HeadRouter is a training-free dynamic head-weight routing method that employs per-sample selectivity statistics to maximize critical token retention for downstream tasks.
Figure 5: Architecture of HeadRouter—QK probing at layer M computes the selectivity spread; RBF mixing over three calibrated weight profiles adapts token scoring dynamically.
HeadRouter Workflow
- Position-Agnostic QK Probing: Audio and text token embeddings at layer M−1 are projected into query and key spaces (sans RoPE). The attention affinity is computed per head across all audio tokens, and a marginal distribution over audio tokens is aggregated across text queries.
- Selectivity Computation: For each head, normalized entropy of the marginal attention distribution yields a selectivity score, indicating how focused the head is on specific tokens.
- Routing Signal: The standard deviation of selectivity across all heads (the “spread”) is computed, functioning as a differentiable proxy for task type—diffuse for semantic, sparse for acoustic.
- Profile Mixing: Three offline-calibrated head-weight profiles (semantic, acoustic, uniform) are softly mixed using RBF kernels centered at the typical spread of each task type.
- Token Importance Scoring: The final mixed head weights are applied to score audio tokens’ relevance, and a frame-based pre-filtering step may further enhance robustness under extreme pruning ratios.
- Token Pruning: The k most important tokens, as determined by the mixed-weighted score, are retained for downstream LLM processing.
Soft Routing Benefits
Unlike hard assignment, the Gaussian soft profile mixing ensures smooth interpolation for mixed-nature inputs and minimizes performance volatility near task boundaries.
Figure 6: Average mixing coefficients per task; distribution aligns closely with audio task typology, confirming effective routing.
Experimental Results
HeadRouter is evaluated principally on AudioMarathon and MMAU-Pro, with Qwen2.5-Omni-3B/7B and Phi-4-Multimodal serving as underlying backbone models.
- Compression Efficacy: At a 30% token pruning ratio, HeadRouter not only avoids performance loss but exceeds baseline performance (e.g., 101.8% and 103.0% of vanilla on Qwen2.5-Omni-3B and 7B, respectively). Competing methods universally degrade under similar compression.
- Balanced Performance: Unlike DART, which may outperform on narrow tasks (e.g., ASR at severe compression) but sacrifices overall balance, HeadRouter consistently yields the best macro-average due to its input-adaptive strategy.
Figure 7: Comparative task performance (semantic vs acoustic) using head-weight profiles; mismatched profiles impair performance, supporting the need for adaptive routing.
- Generalization: Robustness is demonstrated across backbone architectures and evaluation benchmarks, indicating that the selectivity-driven routing signal generalizes well beyond the initial calibration set.
- Efficiency: Pareto-optimal tradeoffs are observed on F1-score versus peak GPU memory/latency, with HeadRouter providing steep reductions in memory and inference time while sustaining or improving performance.
Figure 8: Memory-performance Pareto front for four AudioMarathon tasks; HeadRouter dominates at all tested retention ratios.
- Ablation: Removal of the routing module or frame-based pre-filtering yields quantifiable drops in performance, especially under high pruning ratios, underlining the necessity of both the adaptive routing and hybrid pruning pipeline.
Figure 9: Ablation analysis—F1 recovery degrades if any component of HeadRouter pipeline is removed.
Implications and Future Prospects
The empirical and analytical findings of this work underscore two key theoretical implications:
- Importance Averaging is Suboptimal: Uniform head averaging obscures valuable head-behavior heterogeneity, particularly significant in audio domains; input-adaptive, task-aligned assignment is critical.
- Dynamic Routing is Essential for Heterogeneous Modalities: Task-adaptive, content-aware routing using lightweight statistical proxies (e.g., selectivity spread) can be efficiently applied even under tight resource constraints.
- Pruning as Feature Denoising: HeadRouter’s gains at moderate pruning ratios suggest residual tokens may include harmful or noisy content post-encoding; judicious removal—not just memory reduction—can enhance model output.
In practical deployment, HeadRouter's training-free and plug-in characteristics allow immediate porting to broader LALM frameworks, supporting efficient long-context applications—meetings, podcast analysis, multi-modal assistants. Conceptually, the selectivity-driven soft routing paradigm can generalize to other modalities and future token-efficient architectures.
Conclusion
HeadRouter establishes a new state-of-the-art for audio token pruning in LALMs by exploiting per-task head-behavior heterogeneity and routing adaptively between calibrated head-weight profiles. It reliably delivers lossless or improved performance under aggressive pruning, reduces resource consumption, and generalizes across models and benchmarks. This framework highlights the necessity of input-adaptive approaches for token compression in multimodal models and provides an extensible foundation for efficient and robust long-context audio understanding (2604.23717).