SpiralFormer: Recursive & Multi-Res Transformers

Updated 23 March 2026

SpiralFormer is a family of neural networks that use looped Transformers with cyclic layer skipping and multi-resolution recursion to enhance efficiency and reduce latency.
It employs a low-latency streaming encoder for ASR that integrates early exiting and multi-exit CTC loss to lower computation while maintaining performance.
A multi-resolution looped Transformer variant refines representations hierarchically, yielding improved perplexity and lower FLOPs in large-scale language modeling experiments.

SpiralFormer denotes a family of neural network architectures that leverage looped or recursive Transformers, frequently integrating hierarchical multi-resolution scheduling and/or blockwise circular layer skipping, with the objective of improving computational efficiency and latency while preserving or enhancing empirical performance. The term has been concurrently introduced for two independent architectural innovations: a low latency encoder for streaming speech recognition via cyclic layer skipping and early exiting (Tsunoo et al., 1 Oct 2025), and a multi-resolution looped Transformer for improved hierarchical dependency learning and compute efficiency (Yu et al., 12 Feb 2026). Both approaches share the principle of gradually or cyclically refining representations while reducing redundant computation, but target distinct application domains and design axes.

1. Low-Latency Streaming Encoder: SpiralFormer for ASR

The SpiralFormer encoder for streaming automatic speech recognition (ASR) addresses the bottleneck of encoding latency in blockwise Transformer encoders. Standard streaming ASR pipelines process input frames in overlapping blocks, each containing $N_l$ left-context, $N_c$ central, and $N_r$ right-context frames, emitting only the $N_c$ central frames after $I$ layers of transformation. By minimizing $N_c$ , latency can be reduced, but naively doing so increases computational cost due to a higher chunk emission frequency (Tsunoo et al., 1 Oct 2025).

SpiralFormer employs two primary mechanisms to decouple emission frequency from computation:

Circular Layer Skipping: In each block, only a subset $C_s$ of encoder layers is evaluated, shifting cyclically with block index. Given skip pitch $p$ and block index $b$ , $C_s = \{i | i = 1 + s + kp,\, k=0,\ldots,\lfloor (I-1-s)/p \rfloor\}$ with $s = (b-1) \bmod p$ . Thus, each block computes roughly $I/p$ layers, reducing per-block computation by a factor proportional to $p$ .
Spiral Shifting with Cache: To maintain deep context, block outputs recursively compose newly computed activations with cached results from prior blocks. After each block, all $Z^{(i)}_{b-1}$ for $i \in C_{s'}$ are cached. For $i\leq p$ , $Z^{(i)}_b = \mathrm{enc}^{(i)}(X^{t\in b} + \tilde Z^{(i-1)}_{b-1})$ ; for $i > p$ , $Z^{(i)}_b = \mathrm{enc}^{(i)}(Z^{(i-p)}_b + \tilde Z^{(i-1)}_{b-1})$ .

This arrangement ensures, over $p$ consecutive blocks, that all $I$ layers are covered in a "spiral" fashion while each central output frame can be emitted after only a subset of layers has been computed.

2. Early Exiting and Training Regime

A critical latency reduction strategy for streaming is spiralformer's early exiting. Instead of waiting for every layer to be computed across $p$ blocks, early exit is performed at the deepest layer index in $C_s$ for each block, emitting its central-portion activations $H^s_b$ immediately. The sequence $H^s = (H^s_1,...,H^s_B)$ is concatenated along time to form the streamed encoder output (Tsunoo et al., 1 Oct 2025).

To ensure discriminative utility at each possible exit, training employs a multi-exit CTC loss:

$\mathcal{L} = \mathrm{CTC}(Y, \mathrm{CTC}(H)) + \sum_{s=0}^{p-1} \mathrm{CTC}(Y, \mathrm{CTC}(H^s)),$

where $H$ is the exit at the final (deepest) layer for full-sequence evaluation, and $H^s$ are per-branch emissions.

3. Computational and Latency Analysis

SpiralFormer offers a reduction in both theoretical and empirical latency without increasing computation relative to closely matched baselines. Each block computes only $\sim I/p$ layers, and output emission frequency is increased by choosing small $N_c$ . With appropriately chosen $p$ and small $N_c$ , real-time factors (RTF) remain nearly constant while the system word emission delay (SWD) is significantly reduced.

Key results, as summarized in Table 1 and Table 2 of (Tsunoo et al., 1 Oct 2025):

Model	Layers (%)	Max Latency (ms)	WER (clean/other)	SWD_P50 (ms)
Baseline B2	100	640	3.4 / 8.6	589
SpiralFormer S3	50	400	3.6 / 9.1	462

On LibriSpeech, median SWD is reduced by 21.6% (589→462 ms), with less than 0.3% absolute WER degradation, and per-block computation reduced by 50%. On CSJ, SWD is reduced by 7.0% under equivalent conditions. The per-token compute is proportional to $I/(pN_c)$ ; system can operate with $N_c=2$ without excess cost.

4. Multi-Resolution SpiralFormer for Hierarchical Reasoning

A parallel SpiralFormer architecture targets efficient sequence modeling by looped Transformers with hierarchical, multi-resolution recursion (Yu et al., 12 Feb 2026). Here, standard looped Transformers decouple computational depth (via repeated application of a shared "core") from parameter depth, but operate at full sequence resolution in each iteration, limiting efficiency.

SpiralFormer generalizes this via a multi-resolution recursion schedule: for total sequence length $L$ , loop iteration $t$ operates at effective length $L_t = \lfloor r_t L \rfloor$ where $r_0 < r_1 < \cdots < r_{T-1} = 1$ (e.g., a "doubling" schedule $r_t = 2r_{t-1}$ ). Early iterations process highly compressed representations, with the final loop operating at full length. Each loop consists of:

Pre-loop block ( $N_{\text{pre}}$ distinct layers)
$T$ recursive invocations of a shared $N_{\text{loop}}$ -layer core at resolutions $L_t$
Post-loop block ( $N_{\text{post}}$ layers)

State is updated per loop either additively (Anchor) or through “Memory as State Highways” (MeSH), with up/down-sampling and causal shifts to guarantee autoregressive compatibility.

5. Emergent Functional Specialization and Hierarchical Inductive Bias

The multi-resolution design induces an emergent coarse-to-fine processing pattern. Attention-based analyses demonstrate:

Key-marginal entropy systematically declines across loops, indicating late-stage attention focuses on fewer, more relevant tokens.
Local Attention Mass (LAM) increases over loops, indicating refinement shifts from diffuse, global manipulation (planning) to local, specific refinements.

Switching to fine-to-coarse or eliminating overlap degrades perplexity and disrupts specialization, providing evidence for architectural bias in hierarchical dependency learning (Yu et al., 12 Feb 2026).

6. Efficiency, Empirical Results, and Algorithmic Outline

Parameter count matches classic looped Transformers but compute is reduced by $7$– $11\%$ at matched depth, due to sublinear growth of $\sum_t r_t^2$ under the doubling schedule. In large-scale language modeling (Pythia, 160M–1.4B parameters), SpiralFormer outperforms looped and non-looped baselines in perplexity and/or compute, e.g., at 1.4B parameters, SpiralFormer-L yields lower Pile validation perplexity (7.14) with 7% fewer FLOPs than Pythia-24L (perplexity 7.44) (Yu et al., 12 Feb 2026).

A high-level training-time algorithm is:

v = f_pre(x)
h, G = InitTopo(x, v)
for t in range(T):
    z = S_down(h, r_t)       # down-scale to L_t
    z_hat = f_loop(z)        # shared core
    u = S_up(z_hat, h, r_t)  # up-scale to L
    u_hat = CausalShift(u, s_t)
    h, G = U(u_hat, h, G, t)
h_out = f_post(h)

Here S_down, S_up denote causal down/up-scaling with blockwise aggregation and routers; U is the topology update.

7. Implications, Limitations, and Future Directions

SpiralFormer demonstrates that cyclic or multi-resolution computation can act as a strong architectural prior for both low-latency sequence emission (Tsunoo et al., 1 Oct 2025) and for hierarchical dependency learning and efficiency in looped Transformers (Yu et al., 12 Feb 2026). Explicit schedule-based multi-resolution recurrence induces functionally specialized reasoning stages, mimicking planning-to-refinement transitions.

Limitations include slight non-uniform per-token compute under autoregressive decoding in overlap regime, hand-designed resolution schedules, and simplicity of the summarization/aggregation router. Potential directions include learning adaptive compression schedules, integrating more powerful state-management schemes, and extending to broader classes of compressive or recursive attention architectures.

References:

"Spiralformer: Low Latency Encoder for Streaming Speech Recognition with Circular Layer Skipping and Early Exiting" (Tsunoo et al., 1 Oct 2025)
"SpiralFormer: Looped Transformers Can Learn Hierarchical Dependencies via Multi-Resolution Recursion" (Yu et al., 12 Feb 2026)

Markdown Report Issue Upgrade to Chat

References (2)

Spiralformer: Low Latency Encoder for Streaming Speech Recognition with Circular Layer Skipping and Early Exiting (2025)

SpiralFormer: Looped Transformers Can Learn Hierarchical Dependencies via Multi-Resolution Recursion (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpiralFormer.

SpiralFormer: Recursive & Multi-Res Transformers

1. Low-Latency Streaming Encoder: SpiralFormer for ASR

2. Early Exiting and Training Regime

3. Computational and Latency Analysis

4. Multi-Resolution SpiralFormer for Hierarchical Reasoning

5. Emergent Functional Specialization and Hierarchical Inductive Bias

6. Efficiency, Empirical Results, and Algorithmic Outline

7. Implications, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SpiralFormer: Recursive & Multi-Res Transformers

1. Low-Latency Streaming Encoder: SpiralFormer for ASR

2. Early Exiting and Training Regime

3. Computational and Latency Analysis

4. Multi-Resolution SpiralFormer for Hierarchical Reasoning

5. Emergent Functional Specialization and Hierarchical Inductive Bias

6. Efficiency, Empirical Results, and Algorithmic Outline

7. Implications, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research