Papers
Topics
Authors
Recent
2000 character limit reached

ReSSFormer: Scalable Long-Context Transformer

Updated 9 December 2025
  • ReSSFormer is a Transformer variant that integrates a recurrent reasoning & memory unit, adaptive sparse attention, and a self-organizing encoder to efficiently manage long-context tasks.
  • It replaces traditional layered architectures with iterative state refinement and content-driven structure, drastically reducing computational overhead.
  • Empirical evaluations show that ReSSFormer outperforms strong baselines in language modeling, multi-hop question answering, and structure generalization with a compact parameter budget.

ReSSFormer is a Transformer variant designed for scalable, efficient, and long-context reasoning by integrating three synergistic architectural modules: the Recurrent Reasoning & Memory Unit (R2MU), the Adaptive Sparse Attention Module (ASAM), and the Self-Organizing Encoder Structure (SOES). This framework departs fundamentally from standard layer-stacked Transformer architectures, eschewing dense attention and fixed positional encodings in favor of parameter sharing via recurrent inference, adaptive attention sparsity, and content-driven structure induction. These innovations enable robust performance across language modeling, long-context multi-hop question answering, and structure-sensitive tasks with a compact parameter budget and computational footprint (You et al., 2 Oct 2025).

1. Recurrent and Memory-Augmented Architecture

ReSSFormer replaces the conventional LL-layer deep stacking paradigm with a recurrent inference approach. Instead of computing representations as H(ℓ+1)=Layerℓ(H(ℓ))H^{(\ell+1)} = \text{Layer}_\ell(H^{(\ell)}) for ℓ=1,…,L\ell=1,\dots,L, ReSSFormer iterates a single, shared "Block" for KK steps: H(0)=X,H(t+1)=Block(H(t),M(t)), t=0…K−1H^{(0)} = X, \qquad H^{(t+1)} = \text{Block}(H^{(t)}, M^{(t)}),\ t=0\dots K-1 where M(t)M^{(t)} encapsulates multi-scale memory. With K≪LK\ll L, parameter sharing constrains growth while iterative reasoning simulates deep inference.

Recurrent Reasoning & Memory Unit (R2MU)

R2MU equips each recurrent step with:

  • Token-level cache: Stores recent token representations (e.g., via sliding window).
  • Segment-level summary S(t)∈Rm×dS^{(t)} \in \mathbb{R}^{m\times d}: Represents pooled informational context.

Pooling is defined as: S^(t)=Pool(H(1),…,H(t))\hat S^{(t)} = \text{Pool}\bigl(H^{(1)},\dots,H^{(t)}\bigr) where Pool\text{Pool} may be attention-weighted, projected, or top-kk selected. A gated update with learned α(t)∈[0,1]m\alpha^{(t)}\in[0,1]^m updates the memory: S(t)=α(t)⊙S(t−1)+(1−α(t))⊙S^(t)S^{(t)} = \alpha^{(t)} \odot S^{(t-1)} + (1 - \alpha^{(t)}) \odot \hat S^{(t)} The recurrence relation is: H(t+1)=Block(H(t),(C(t),S(t)))H^{(t+1)} = \text{Block}(H^{(t)}, (C^{(t)}, S^{(t)})) enabling iterative state refinement with bounded parameters and controlled memory.

2. Adaptive Sparse Attention Module (ASAM)

ASAM systematically reduces attention complexity and enhances focus by employing:

  • Sparsity-inducing Activation: Replaces softmax with alternatives such as sparsemax or entmaxα\text{entmax}_\alpha,

A=ϕ(QK⊤/d)A = \phi(QK^\top / \sqrt d)

resulting in attention matrices with many exact zeros.

  • Top-kk Key Pruning: For each query qiq_i, only the top kk keys (by dot product) are selected, driving computational and memory cost from O(n2)O(n^2) to O(nk)O(nk) (k≪nk\ll n): Ii=TopK(qiK⊤,k)\mathcal I_i = \text{TopK}(q_i K^\top, k)

Attn(qi)=∑j∈IiAijvj\mathrm{Attn}(q_i) = \sum_{j \in \mathcal I_i} A_{ij} v_j

  • Per-Head Mixture-of-Experts (MoE) Routing: Each attention head has EE small "experts." A router r:Rd→ΔEr: \mathbb{R}^d \to \Delta^E produces a sparse pp; only top e≪Ee\ll E experts are activated per query: p=r(qi)∈ΔE,Ei=TopE(p,e)p = r(q_i)\in\Delta^E, \qquad \mathcal E_i = \text{TopE}(p,e) Pruning at both token and expert levels ensures O(nkd+ned)O(n k d + n e d) time and O(nk+ne)O(n k + n e) memory.

3. Self-Organizing Encoder Structure (SOES)

SOES eliminates explicit positional encodings, instead letting structure emerge from content-driven graph induction.

  • Content-Driven Edge Weights: Defines a latent graph G(t)G^{(t)} over tokens V={xi}V = \{x_i\}, with learned edge scores: eij(t)=ψ(qi(t),kj(t))e_{ij}^{(t)} = \psi(q_i^{(t)}, k_j^{(t)}) where ψ\psi is typically an MLP or dot-product kernel.
  • Structural Regularization: A regularization term penalizes abrupt changes in graph topology between time steps: Lstruct=∑t=1K−1∑i,j∥eij(t)−eij(t−1)∥2\mathcal L_{\mathrm{struct}} = \sum_{t=1}^{K-1} \sum_{i,j} \left\|e_{ij}^{(t)} - e_{ij}^{(t-1)}\right\|^2
  • Position-Free Attention: Attention patterns in ASAM are modulated by the emergent G(t)G^{(t)}, disconnecting inference from token indices or positional priors.

4. Training Protocols and Experimental Setup

ReSSFormer employs AdamW with linear warmup, cosine decay, mixed precision, and early stopping. Training uses 8×\timesA100 GPUs. Key hyperparameters include K=4K=4 recurrence, top-k=32k=32 attention sparsity, m=128m=128 memory slots, E=8E=8 experts per head with e=2e=2 activations, and a ∼\sim125M parameter cap within ∼\sim120B training FLOPs. Primary benchmarks:

  • Long-context and multi-hop QA: NarrativeQA, HotpotQA, GSM8K
  • Language modeling: Wikitext-103, PG-19 (4–8k tokens)
  • Structure generalization: TabFact, OGB-Arxiv, shuffled paragraph QA

Baselines include GPT-2, Longformer, BigBird, Performer, Graphformer, and TabTransformer.

5. Empirical Evaluation and Comparative Analysis

ReSSFormer exhibits consistent advantages under compute- and parameter-matched conditions.

Long-Context Reasoning

ReSSFormer maintains ∼\sim78% accuracy on up to 8k-token inputs, compared to 66–74% for strong alternatives. At 4k tokens, performance and computational efficiency are summarized below:

Model Accuracy (%) Latency (ms) FLOPs (G)
GPT-2 66.4 91 220
Longformer 69.7 104 190
BigBird 71.2 97 183
RoPE-TF 73.5 112 226
ReSSFormer 77.8 95 172

Language Modeling Efficiency

On Wikitext-103 and PG-19 with a 120B FLOP budget:

  • ReSSFormer: 17.4/25.8 PPL, 162G FLOPs/step, 74 ms
  • GPT-2: 19.2/28.1 PPL, 215G, 91 ms
  • Performer and MoE baselines underperform in both PPL and efficiency.

Structure Generalization

SOES yields robustness to structural noise, outperforming position-aware models by 3–6% on TabFact, OGB-Arxiv, and shuffled-QA. This demonstrates transfer capacity across table, graph, and permuted text input modalities.

Robustness to Distractors

With injected irrelevant content, ReSSFormer exhibits <11%<11\% loss in relative accuracy, compared to 15–20% for non-sparse baseline architectures.

6. Ablation and Functional Decomposition

Ablative studies on Wikitext-103 (perplexity, lower is better) and HotpotQA (EM, higher is better):

Model Variant PPL EM (%) Δ\Delta Rel.
Full 17.4 71.2 –
– R2MU 19.0 67.8 –2.5%
– ASAM 18.6 66.1 –3.7%
– SOES 20.4 64.9 –5.9%

Interpretations:

  • R2MU is indispensable for iterative evidence aggregation,
  • ASAM enables allocation of compute to salient content and direct robustness gains,
  • SOES confers the largest marginal improvement, evidencing the impact of removing fixed positional encoding in favor of content-structured attention.

7. Synthesis and Significance

ReSSFormer, by composing recurrent inference with memory (R2MU), sparse adaptive attention (ASAM), and self-organizing structural encoding (SOES), advances the feasible range and robustness of long-context and structure-sensitive Transformer-based models within moderate compute and parameter budgets. These components collectively facilitate long-sequence reasoning, efficiency under heavy sequence lengths, resilience to context distractors, and generalization across unstructured and structured modalities, providing a compelling foundation for further development in scalable neural sequence modeling (You et al., 2 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to ReSSFormer.