Papers
Topics
Authors
Recent
Search
2000 character limit reached

SpiralFormer: Recursive & Multi-Res Transformers

Updated 23 March 2026
  • SpiralFormer is a family of neural networks that use looped Transformers with cyclic layer skipping and multi-resolution recursion to enhance efficiency and reduce latency.
  • It employs a low-latency streaming encoder for ASR that integrates early exiting and multi-exit CTC loss to lower computation while maintaining performance.
  • A multi-resolution looped Transformer variant refines representations hierarchically, yielding improved perplexity and lower FLOPs in large-scale language modeling experiments.

SpiralFormer denotes a family of neural network architectures that leverage looped or recursive Transformers, frequently integrating hierarchical multi-resolution scheduling and/or blockwise circular layer skipping, with the objective of improving computational efficiency and latency while preserving or enhancing empirical performance. The term has been concurrently introduced for two independent architectural innovations: a low latency encoder for streaming speech recognition via cyclic layer skipping and early exiting (Tsunoo et al., 1 Oct 2025), and a multi-resolution looped Transformer for improved hierarchical dependency learning and compute efficiency (Yu et al., 12 Feb 2026). Both approaches share the principle of gradually or cyclically refining representations while reducing redundant computation, but target distinct application domains and design axes.

1. Low-Latency Streaming Encoder: SpiralFormer for ASR

The SpiralFormer encoder for streaming automatic speech recognition (ASR) addresses the bottleneck of encoding latency in blockwise Transformer encoders. Standard streaming ASR pipelines process input frames in overlapping blocks, each containing NlN_l left-context, NcN_c central, and NrN_r right-context frames, emitting only the NcN_c central frames after II layers of transformation. By minimizing NcN_c, latency can be reduced, but naively doing so increases computational cost due to a higher chunk emission frequency (Tsunoo et al., 1 Oct 2025).

SpiralFormer employs two primary mechanisms to decouple emission frequency from computation:

  • Circular Layer Skipping: In each block, only a subset CsC_s of encoder layers is evaluated, shifting cyclically with block index. Given skip pitch pp and block index bb, Cs={ii=1+s+kp,k=0,,(I1s)/p}C_s = \{i | i = 1 + s + kp,\, k=0,\ldots,\lfloor (I-1-s)/p \rfloor\} with s=(b1)modps = (b-1) \bmod p. Thus, each block computes roughly I/pI/p layers, reducing per-block computation by a factor proportional to pp.
  • Spiral Shifting with Cache: To maintain deep context, block outputs recursively compose newly computed activations with cached results from prior blocks. After each block, all Zb1(i)Z^{(i)}_{b-1} for iCsi \in C_{s'} are cached. For ipi\leq p, Zb(i)=enc(i)(Xtb+Z~b1(i1))Z^{(i)}_b = \mathrm{enc}^{(i)}(X^{t\in b} + \tilde Z^{(i-1)}_{b-1}); for i>pi > p, Zb(i)=enc(i)(Zb(ip)+Z~b1(i1))Z^{(i)}_b = \mathrm{enc}^{(i)}(Z^{(i-p)}_b + \tilde Z^{(i-1)}_{b-1}).

This arrangement ensures, over pp consecutive blocks, that all II layers are covered in a "spiral" fashion while each central output frame can be emitted after only a subset of layers has been computed.

2. Early Exiting and Training Regime

A critical latency reduction strategy for streaming is spiralformer's early exiting. Instead of waiting for every layer to be computed across pp blocks, early exit is performed at the deepest layer index in CsC_s for each block, emitting its central-portion activations HbsH^s_b immediately. The sequence Hs=(H1s,...,HBs)H^s = (H^s_1,...,H^s_B) is concatenated along time to form the streamed encoder output (Tsunoo et al., 1 Oct 2025).

To ensure discriminative utility at each possible exit, training employs a multi-exit CTC loss:

L=CTC(Y,CTC(H))+s=0p1CTC(Y,CTC(Hs)),\mathcal{L} = \mathrm{CTC}(Y, \mathrm{CTC}(H)) + \sum_{s=0}^{p-1} \mathrm{CTC}(Y, \mathrm{CTC}(H^s)),

where HH is the exit at the final (deepest) layer for full-sequence evaluation, and HsH^s are per-branch emissions.

3. Computational and Latency Analysis

SpiralFormer offers a reduction in both theoretical and empirical latency without increasing computation relative to closely matched baselines. Each block computes only I/p\sim I/p layers, and output emission frequency is increased by choosing small NcN_c. With appropriately chosen pp and small NcN_c, real-time factors (RTF) remain nearly constant while the system word emission delay (SWD) is significantly reduced.

Key results, as summarized in Table 1 and Table 2 of (Tsunoo et al., 1 Oct 2025):

Model Layers (%) Max Latency (ms) WER (clean/other) SWD_P50 (ms)
Baseline B2 100 640 3.4 / 8.6 589
SpiralFormer S3 50 400 3.6 / 9.1 462

On LibriSpeech, median SWD is reduced by 21.6% (589→462 ms), with less than 0.3% absolute WER degradation, and per-block computation reduced by 50%. On CSJ, SWD is reduced by 7.0% under equivalent conditions. The per-token compute is proportional to I/(pNc)I/(pN_c); system can operate with Nc=2N_c=2 without excess cost.

4. Multi-Resolution SpiralFormer for Hierarchical Reasoning

A parallel SpiralFormer architecture targets efficient sequence modeling by looped Transformers with hierarchical, multi-resolution recursion (Yu et al., 12 Feb 2026). Here, standard looped Transformers decouple computational depth (via repeated application of a shared "core") from parameter depth, but operate at full sequence resolution in each iteration, limiting efficiency.

SpiralFormer generalizes this via a multi-resolution recursion schedule: for total sequence length LL, loop iteration tt operates at effective length Lt=rtLL_t = \lfloor r_t L \rfloor where r0<r1<<rT1=1r_0 < r_1 < \cdots < r_{T-1} = 1 (e.g., a "doubling" schedule rt=2rt1r_t = 2r_{t-1}). Early iterations process highly compressed representations, with the final loop operating at full length. Each loop consists of:

  1. Pre-loop block (NpreN_{\text{pre}} distinct layers)
  2. TT recursive invocations of a shared NloopN_{\text{loop}}-layer core at resolutions LtL_t
  3. Post-loop block (NpostN_{\text{post}} layers)

State is updated per loop either additively (Anchor) or through “Memory as State Highways” (MeSH), with up/down-sampling and causal shifts to guarantee autoregressive compatibility.

5. Emergent Functional Specialization and Hierarchical Inductive Bias

The multi-resolution design induces an emergent coarse-to-fine processing pattern. Attention-based analyses demonstrate:

  • Key-marginal entropy systematically declines across loops, indicating late-stage attention focuses on fewer, more relevant tokens.
  • Local Attention Mass (LAM) increases over loops, indicating refinement shifts from diffuse, global manipulation (planning) to local, specific refinements.

Switching to fine-to-coarse or eliminating overlap degrades perplexity and disrupts specialization, providing evidence for architectural bias in hierarchical dependency learning (Yu et al., 12 Feb 2026).

6. Efficiency, Empirical Results, and Algorithmic Outline

Parameter count matches classic looped Transformers but compute is reduced by $7$–11%11\% at matched depth, due to sublinear growth of trt2\sum_t r_t^2 under the doubling schedule. In large-scale language modeling (Pythia, 160M–1.4B parameters), SpiralFormer outperforms looped and non-looped baselines in perplexity and/or compute, e.g., at 1.4B parameters, SpiralFormer-L yields lower Pile validation perplexity (7.14) with 7% fewer FLOPs than Pythia-24L (perplexity 7.44) (Yu et al., 12 Feb 2026).

A high-level training-time algorithm is:

1
2
3
4
5
6
7
8
9
v = f_pre(x)
h, G = InitTopo(x, v)
for t in range(T):
    z = S_down(h, r_t)       # down-scale to L_t
    z_hat = f_loop(z)        # shared core
    u = S_up(z_hat, h, r_t)  # up-scale to L
    u_hat = CausalShift(u, s_t)
    h, G = U(u_hat, h, G, t)
h_out = f_post(h)

Here S_down, S_up denote causal down/up-scaling with blockwise aggregation and routers; U is the topology update.

7. Implications, Limitations, and Future Directions

SpiralFormer demonstrates that cyclic or multi-resolution computation can act as a strong architectural prior for both low-latency sequence emission (Tsunoo et al., 1 Oct 2025) and for hierarchical dependency learning and efficiency in looped Transformers (Yu et al., 12 Feb 2026). Explicit schedule-based multi-resolution recurrence induces functionally specialized reasoning stages, mimicking planning-to-refinement transitions.

Limitations include slight non-uniform per-token compute under autoregressive decoding in overlap regime, hand-designed resolution schedules, and simplicity of the summarization/aggregation router. Potential directions include learning adaptive compression schedules, integrating more powerful state-management schemes, and extending to broader classes of compressive or recursive attention architectures.

References:

  • "Spiralformer: Low Latency Encoder for Streaming Speech Recognition with Circular Layer Skipping and Early Exiting" (Tsunoo et al., 1 Oct 2025)
  • "SpiralFormer: Looped Transformers Can Learn Hierarchical Dependencies via Multi-Resolution Recursion" (Yu et al., 12 Feb 2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpiralFormer.