Papers
Topics
Authors
Recent
Search
2000 character limit reached

Recurrent Split-Pool Model

Updated 22 January 2026
  • The paper introduces a neural architecture that integrates recurrent gated layers with recursive pooling to achieve linear-time sequence modeling.
  • It replaces quadratic transformer self-attention with recurrent SkipBlock constructs that contract and restore sequences for multi-scale temporal processing.
  • Empirical results demonstrate improved training speed, perceptual metrics, and generalization, making it suitable for raw audio, text, and vision applications.

The Recurrent Split-Pool Model is a sequence-to-sequence neural architecture developed to address long-sequence modeling challenges, particularly in domains such as raw audio, text, and dense computer vision inputs. It eschews the quadratic complexity of transformer-based self-attention layers in favor of linear-time recurrent gated layers combined with hierarchical pooling, which recurrently contracts and restores the sequence length to allow efficient information integration at multiple temporal resolutions. This approach is realized in the Poolformer architecture, leveraging recursive “SkipBlock” constructs for multi-scale processing (Fernández, 2 Oct 2025).

1. Architectural Overview

Poolformer is a fully autoregressive sequence-to-sequence model architected for linear-time temporal mixing by replacing O(T²) self-attention with O(T) recurrent gated layers (RG-LRUs) and interleaved pooling operations. The input passes through a sinusoidal embedding, followed by a hierarchy of “levels,” each implemented as a recursive SkipBlock. Every level downsamples the input sequence, processes the contracted representation at a lower resolution, upsamples back to the original sequence length, and applies further recurrent mixing before outputting logits.

Unlike transformer-style pairwise attention, temporal dependencies are modeled in sequence via gated recurrence. Pooling factors (p₁, p₂, …, pₙ) systematically reduce the sequence length as depth increases: long-range contextual integration occurs in deep, downsampled layers, while shallow layers maintain high-resolution, short-term patterns. Shallow layers operate on the full sequence for fine-grained feature extraction; deep layers act on heavily pooled representations, enabling integration over distant time indices for long-range dependency modeling.

2. SkipBlock Definition and Recursive Construction

A depth-d SkipBlock S_d(·) constitutes five sequential components:

  1. Pre-pool residual block R₁
  2. Down-pooling layer Π_down, reducing sequence length by pooling factor p
  3. Nested (depth d–1) SkipBlock S_{d–1}
  4. Up-pooling layer Π_up, restoring sequence length
  5. Post-pool residual block R₂ with residual-stream addition

Given an input XRT×CX\in \mathbb{R}^{T\times C}, the computation proceeds as follows:

  • Pre-pool processing: X1=R1(X)X_1 = R_1(X)
  • Down-pooling: X=Πdown(X1)RT/p×CX' = \Pi_\text{down}(X_1)\in \mathbb{R}^{T/p \times C}
  • Recursive low-resolution processing: Y=Sd1(X)Y' = S_{d-1}(X')
  • Up-pooling: Y^=Πup(Y)RT×C\hat{Y} = \Pi_\text{up}(Y')\in \mathbb{R}^{T\times C}
  • Post-pool processing: Sd(X)=R2(Y^)+skip(X)S_d(X) = R_2(\hat{Y}) + \text{skip}(X)

For the base case (d=0d=0), S0S_0 is a stack of fixed recurrent residual layers, with no further pooling or recursion. The recursive composition is formalized as:

Sd(X)=R2ΠupSd1ΠdownR1(X)S_d(X) = R_2 \circ \Pi_\text{up} \circ S_{d-1} \circ \Pi_\text{down} \circ R_1(X)

Pooling operations Πdown\Pi_\text{down} and Πup\Pi_\text{up} manage the reduction and restoration of temporal dimension, respectively.

3. Temporal Mixing via RG-LRU and Computational Complexity

The per-layer, per-time-step recurrence in Poolformer is performed by RG-LRU units. At time tt and layer ii, the update for input xtRDx_t\in\mathbb{R}^D and hidden state ht1(i)RDh_{t-1}^{(i)}\in\mathbb{R}^D is:

rt=σ(Waxt+ba) it=σ(Wxxt+bx) at=acrt ht(i)=atht1(i)+1at2(itxt)\begin{aligned} r_t &= \sigma(W_a x_t + b_a) \ i_t &= \sigma(W_x x_t + b_x) \ a_t &= a^{c r_t} \ h_t^{(i)} &= a_t \circ h_{t-1}^{(i)} + \sqrt{1-a_t^2} \circ (i_t \circ x_t) \end{aligned}

Here, σ\sigma is the sigmoid, \circ denotes the element-wise product, and a(0,1)Da\in(0,1)^D (real or complex) controls the temporal decay vs. persistence.

Time and memory requirements without pooling are O(TD2)O(TD^2) and O(TD)O(TD) per layer, respectively. Incorporation of nn pooling levels with factors p1,,pnp_1,\ldots,p_n yields:

l=0nO(Tj=1lpjD2)=O(TD2l=0npl)O(TD211/p)\sum_{l=0}^n O\left(\frac{T}{\prod_{j=1}^l p_j} D^2\right) = O\left(T D^2 \sum_{l=0}^n p^{-l}\right) \approx O\left(\frac{T D^2}{1-1/p}\right)

For modest pp (2–5), this remains linear in TT, in contrast to transformer self-attention which is O(T2D)O(T^2 D) in time and O(T2)O(T^2) in memory, rendering Poolformer tractable for raw audio with T105T\sim 10^510610^6.

4. Training Protocols and Empirical Findings

Poolformer was evaluated on raw 16 kHz, 8-bit–quantized audio benchmarks:

Dataset Length (T) Description
SC09 16,000 1 s spoken digits
Beethoven 128,000 8 s piano
YouTubeMix ~960,000 60 s piano

Metrics included negative log-likelihood (NLL, bits per token), Fréchet Inception Distance (FID), and Inception Score (IS), all on mel-spectrogram embeddings.

On SC09, Poolformer (7.3M parameters) achieved NLL=1.854, FID=0.46, and IS=6.46, outperforming SaShiMi (4.8M, NLL=1.891, FID=1.81, IS=3.89) and Mamba (6.1M, NLL=1.852, FID=0.94, IS=6.26). Implementation with full 4-level pooling yielded 11.45 epochs/hour training speed (>80%>80\% speedup over no pooling) and improved perceptual metrics, with substantial reductions in overfitting.

5. Layerwise Behavior and Information Integration

Analysis of learned RG-LRU decay parameters revealed depth-dependent memory behavior: deeper layers converged to a1|a|\approx 1 (slow decay), favoring long-range integration across pooled, subsampled representations. Shallow layers exhibited smaller a|a| (rapid decay), emphasizing short-term, localized feature processing in the full-resolution sequence.

This distribution of roles between layers supports hierarchical processing, with deep Split-Pool recursion capturing global dependencies and high-level temporal patterns, and surface levels resolving local structure and fine granularity.

6. Design Motivations and Adaptability to Other Domains

The model was motivated by the intractability of self-attention for long TT and the observed limitations of raw RNNs in encoding distant interactions. Sequence pooling mediates between fine detail extraction in shallow recurrent blocks and scalable global mixing in deep, downsampled representation layers. This design enables tractable long-sequence modeling with competitive generalization performance.

Potential extensions include:

  • Text modeling: Poolformer can process entire long-form documents by pooling sentences or paragraphs before recurrence, accommodating lengths otherwise infeasible for transformers.
  • Vision modeling: Dense ViT embeddings yield O(104)O(10^4) patch tokens; Split-Pool processing admits direct ingestion and multi-scale integration.
  • Multi-modal applications: Pooling supports ingestion of hundreds of image or video frame tokens with text, allowing cross-modal context mixing over extended streams.

A plausible implication is Poolformer’s suitability for multi-modal LLMs consuming dense representations of images, videos, and text.

7. Summary and Comparative Positioning

The Recurrent Split-Pool Model, as realized in Poolformer, is an efficient, pooling-augmented fully recurrent alternative to self-attention for long-sequence modeling. It achieves linear time and memory scaling in sequence length, experimentally surpassing leading SSMs and transformer variants on long raw audio benchmarks. Pooling not only accelerates training and enhances perceptual metric scores—FID and IS—but also mitigates overfitting by improving generalization. The architecture’s hierarchical and recursive pooling mechanism supports robust integration of long-term dependencies while preserving fine local features, rendering it practicable for a broad array of domains, including text, vision, and multi-modal data (Fernández, 2 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Recurrent Split-Pool Model.