FastLongSpeech: Iterative Fusion for LSLMs
- The paper introduces an iterative fusion extractor that fuses redundant speech frames using content density and cosine similarity for efficient LLM inference.
- FastLongSpeech achieves up to 16× compute reduction by condensing long speech sequences while maintaining semantic integrity and performance on benchmarks.
- Dynamic compression-ratio training transfers LSLM capabilities from short-speech to long-speech tasks, enabling robust handling of extensive audio inputs.
FastLongSpeech introduces an iterative fusion strategy for efficiently compressing long-form speech representations to fit within the inference window of Large Speech-LLMs (LSLMs). Conventional LSLMs face substantial computational and memory bottlenecks with long speech signals, whose frame-wise representation can vastly exceed model limitations. The iterative fusion extractor, positioned between the audio encoder and the LLM, performs content-aware dynamic sequence condensation. It achieves significant reductions in inference complexity by merging redundant spans, leveraging per-frame “content density” signals and pairwise similarity metrics. The approach obviates the need for long-speech-specific training data by transferring LSLM capabilities from short-speech domains via dynamic compression-ratio training, enabling high-fidelity long-speech understanding and generation at a fraction of the compute.
1. Motivation and Conceptual Framework
Long speech signals sampled at frame rates (e.g., 25 Hz for 5-minute audio) produce frame sequences of , which incurs prohibitive cost for transformer attention operations, severely straining GPU memory and runtime in the speech-LLM adaptor. Analysis reveals that adjacent frames often encode redundant information due to low phonetic or semantic variability—a phenomenon characterized by low “content density” and high mutual similarity.
To mitigate this redundancy while retaining essential semantic content, FastLongSpeech introduces an iterative fusion extractor. At each iteration, it (1) computes per-frame content density using CTC model outputs, (2) evaluates cosine similarity between adjacent frames to quantify redundancy, (3) selects the most redundant contiguous spans, and (4) fuses these spans into single representative frames using density-weighted pooling. This process repeats, reducing the sequence to a target length suitable for the model’s speech window, thereby diminishing computational requirements from to with minimal information loss (Guo et al., 20 Jul 2025).
2. Formal Algorithmic Specification and Mathematical Definitions
Let represent encoder outputs for the raw speech sequence. The iterative process is defined as follows for iteration :
2.1 Content Density Computation
For frame , define the “content density” as: where denotes the CTC blank. This quantity serves as a saliency measure, prioritizing frames with substantive (non-blank) CTC token probability mass.
2.2 Adjacency-Based Similarity
Redundancy between adjacent frames is assessed by cosine similarity: 0
2.3 Iterative Schedule and Span Fusion
Let 1 denote the current sequence length. The iteration progresses as: 2 The number of frames to merge is 3.
The top 4 most similar adjacent pairs are selected, automatically forming (possibly larger) contiguous spans 5 due to overlapping merges. Each such span is fused via a density-weighted sum: 6 Non-merged frames remain unaltered. The result 7 is the concatenation (in original order) of fused spans and untouched frames, with process repeated until 8.
Algorithmic Summary
0
3. Dynamic Compression Ratio Training
To ensure the LLM is robust across a spectrum of compression settings, FastLongSpeech employs dynamic compression-ratio training. At each fine-tuning batch, the target length 9 is randomly sampled 0, and the instruction-following loss is computed over the fused sequence: 1 where 2 denotes the iterative fusion operation. Compression levels range from mild (e.g., 3) to highly aggressive (4), exposing the model to input sequences of varying condensation. This mechanism enables transfer of LSLM capabilities from short to long speech tasks, allowing models to extract reasoning cues even under heavy fusion (Guo et al., 20 Jul 2025).
4. Complexity Analysis and Efficiency Gains
Attention computation for naive speech frame sequences scales as 5 FLOPs and 6 memory, which is frequently intractable for large 7. By reducing sequence length to 8, attention cost diminishes to 9 FLOPs and 0 memory. For example, setting 1 realizes a 2 reduction in compute.
Each fusion iteration costs 3 (content/similarity computation) plus 4 for bookkeeping, which is negligible compared to self-attention overhead. Empirical results on Qwen2-Audio base (5) demonstrate, for short-speech, a compression to 6 reduces transformer TFLOPs from 9.79 to 4.17 (72.38 speedup) with 9 drop in quality. For long-speech, fusing to 0 (from over 4000) cuts runtime from 4.80s to 1.47s (170% faster), TFLOPs from 61.2 to 26.4 (22.33), and even improves LongSpeech-Eval QA score from 3.44 to 3.55 (Guo et al., 20 Jul 2025).
5. Experimental Outcomes on Speech Understanding and Generation
Benchmarking on diverse short- and long-form evaluation sets demonstrates the efficacy of iterative fusion:
- Short-speech QA (iemocap, LibriTTS, LibriSQA): Fusion at 4–5 compression outperforms AvgPool/MostSim by 0.2–0.5 score points.
- LongSpeech-Eval long-speech QA: FastLongSpeech (fusion+dynamic training) achieves 3.55, improved over NTK-RoPE (3.44) and AvgPool (3.10).
- ASR WER (LibriSpeech clean): Baseline 3.85%; fused to 6: 4.04%; and 7: 4.08%, indicating negligible degradation at mild fusion.
- Cost-Quality Tradeoff Across Tasks: Iterative fusion yields 850% reduction in inference cost at iso-quality or better, validated on dialogue QA and emotion recognition.
Through content-aware, similarity-guided merging in 9 iterations, FastLongSpeech enables LSLMs to handle long-form speech inputs at scale and cost comparable to short-speech tasks without degrading semantic content or reasoning capacity (Guo et al., 20 Jul 2025).
6. Context, Implications, and Future Perspective
Iterative fusion, as implemented in FastLongSpeech, provides an explicit, content-driven mechanism for reducing temporal redundancy in speech signals presented to LSLMs. The key insight is that dynamic, information-preserving condensation exploits the natural structure of human speech, where large sections are acoustically/redundantly similar, and local context boundaries are reliably captured by CTC-derived content density measures and pairwise similarity.
A plausible implication is that this approach may generalize to other sequence modeling domains where redundancy is high and operational context windows are limited, such as long-form video or clinical time series. Its effect is to close the gap between the efficient resource use in short-speech or text-only LLMs and the high bandwidth of real-world, long-form audio. The methodology enables high-fidelity speech-language modeling without extensive long-form supervision, expanding the practical domain of LSLMs for audio-intensive tasks (Guo et al., 20 Jul 2025).