ReSSFormer: Scalable Long-Context Transformer

Updated 9 December 2025

ReSSFormer is a Transformer variant that integrates a recurrent reasoning & memory unit, adaptive sparse attention, and a self-organizing encoder to efficiently manage long-context tasks.
It replaces traditional layered architectures with iterative state refinement and content-driven structure, drastically reducing computational overhead.
Empirical evaluations show that ReSSFormer outperforms strong baselines in language modeling, multi-hop question answering, and structure generalization with a compact parameter budget.

ReSSFormer is a Transformer variant designed for scalable, efficient, and long-context reasoning by integrating three synergistic architectural modules: the Recurrent Reasoning & Memory Unit (R2MU), the Adaptive Sparse Attention Module (ASAM), and the Self-Organizing Encoder Structure (SOES). This framework departs fundamentally from standard layer-stacked Transformer architectures, eschewing dense attention and fixed positional encodings in favor of parameter sharing via recurrent inference, adaptive attention sparsity, and content-driven structure induction. These innovations enable robust performance across language modeling, long-context multi-hop question answering, and structure-sensitive tasks with a compact parameter budget and computational footprint (You et al., 2 Oct 2025).

1. Recurrent and Memory-Augmented Architecture

ReSSFormer replaces the conventional $L$ -layer deep stacking paradigm with a recurrent inference approach. Instead of computing representations as $H^{(\ell+1)} = \text{Layer}_\ell(H^{(\ell)})$ for $\ell=1,\dots,L$ , ReSSFormer iterates a single, shared "Block" for $K$ steps: $H^{(0)} = X, \qquad H^{(t+1)} = \text{Block}(H^{(t)}, M^{(t)}),\ t=0\dots K-1$ where $M^{(t)}$ encapsulates multi-scale memory. With $K\ll L$ , parameter sharing constrains growth while iterative reasoning simulates deep inference.

Recurrent Reasoning & Memory Unit (R2MU)

R2MU equips each recurrent step with:

Token-level cache: Stores recent token representations (e.g., via sliding window).
Segment-level summary $S^{(t)} \in \mathbb{R}^{m\times d}$ : Represents pooled informational context.

Pooling is defined as: $\hat S^{(t)} = \text{Pool}\bigl(H^{(1)},\dots,H^{(t)}\bigr)$ where $\text{Pool}$ may be attention-weighted, projected, or top- $k$ selected. A gated update with learned $\alpha^{(t)}\in[0,1]^m$ updates the memory: $S^{(t)} = \alpha^{(t)} \odot S^{(t-1)} + (1 - \alpha^{(t)}) \odot \hat S^{(t)}$ The recurrence relation is: $H^{(t+1)} = \text{Block}(H^{(t)}, (C^{(t)}, S^{(t)}))$ enabling iterative state refinement with bounded parameters and controlled memory.

2. Adaptive Sparse Attention Module (ASAM)

ASAM systematically reduces attention complexity and enhances focus by employing:

Sparsity-inducing Activation: Replaces softmax with alternatives such as sparsemax or $\text{entmax}_\alpha$ ,

$A = \phi(QK^\top / \sqrt d)$

resulting in attention matrices with many exact zeros.

Top- $k$ Key Pruning: For each query $q_i$ , only the top $k$ keys (by dot product) are selected, driving computational and memory cost from $O(n^2)$ to $O(nk)$ ( $k\ll n$ ): $\mathcal I_i = \text{TopK}(q_i K^\top, k)$

$\mathrm{Attn}(q_i) = \sum_{j \in \mathcal I_i} A_{ij} v_j$

Per-Head Mixture-of-Experts (MoE) Routing: Each attention head has $E$ small "experts." A router $r: \mathbb{R}^d \to \Delta^E$ produces a sparse $p$ ; only top $e\ll E$ experts are activated per query: $p = r(q_i)\in\Delta^E, \qquad \mathcal E_i = \text{TopE}(p,e)$ Pruning at both token and expert levels ensures $O(n k d + n e d)$ time and $O(n k + n e)$ memory.

3. Self-Organizing Encoder Structure (SOES)

SOES eliminates explicit positional encodings, instead letting structure emerge from content-driven graph induction.

Content-Driven Edge Weights: Defines a latent graph $G^{(t)}$ over tokens $V = \{x_i\}$ , with learned edge scores: $e_{ij}^{(t)} = \psi(q_i^{(t)}, k_j^{(t)})$ where $\psi$ is typically an MLP or dot-product kernel.
Structural Regularization: A regularization term penalizes abrupt changes in graph topology between time steps: $\mathcal L_{\mathrm{struct}} = \sum_{t=1}^{K-1} \sum_{i,j} \left\|e_{ij}^{(t)} - e_{ij}^{(t-1)}\right\|^2$
Position-Free Attention: Attention patterns in ASAM are modulated by the emergent $G^{(t)}$ , disconnecting inference from token indices or positional priors.

4. Training Protocols and Experimental Setup

ReSSFormer employs AdamW with linear warmup, cosine decay, mixed precision, and early stopping. Training uses 8 $\times$ A100 GPUs. Key hyperparameters include $K=4$ recurrence, top- $k=32$ attention sparsity, $m=128$ memory slots, $E=8$ experts per head with $e=2$ activations, and a $\sim$ 125M parameter cap within $\sim$ 120B training FLOPs. Primary benchmarks:

Long-context and multi-hop QA: NarrativeQA, HotpotQA, GSM8K
Language modeling: Wikitext-103, PG-19 (4–8k tokens)
Structure generalization: TabFact, OGB-Arxiv, shuffled paragraph QA

Baselines include GPT-2, Longformer, BigBird, Performer, Graphformer, and TabTransformer.

5. Empirical Evaluation and Comparative Analysis

ReSSFormer exhibits consistent advantages under compute- and parameter-matched conditions.

Long-Context Reasoning

ReSSFormer maintains $\sim$ 78% accuracy on up to 8k-token inputs, compared to 66–74% for strong alternatives. At 4k tokens, performance and computational efficiency are summarized below:

Model	Accuracy (%)	Latency (ms)	FLOPs (G)
GPT-2	66.4	91	220
Longformer	69.7	104	190
BigBird	71.2	97	183
RoPE-TF	73.5	112	226
ReSSFormer	77.8	95	172

Language Modeling Efficiency

On Wikitext-103 and PG-19 with a 120B FLOP budget:

ReSSFormer: 17.4/25.8 PPL, 162G FLOPs/step, 74 ms
GPT-2: 19.2/28.1 PPL, 215G, 91 ms
Performer and MoE baselines underperform in both PPL and efficiency.

Structure Generalization

SOES yields robustness to structural noise, outperforming position-aware models by 3–6% on TabFact, OGB-Arxiv, and shuffled-QA. This demonstrates transfer capacity across table, graph, and permuted text input modalities.

Robustness to Distractors

With injected irrelevant content, ReSSFormer exhibits $<11\%$ loss in relative accuracy, compared to 15–20% for non-sparse baseline architectures.

6. Ablation and Functional Decomposition

Ablative studies on Wikitext-103 (perplexity, lower is better) and HotpotQA (EM, higher is better):

Model Variant	PPL	EM (%)	$\Delta$ Rel.
Full	17.4	71.2	–
– R2MU	19.0	67.8	–2.5%
– ASAM	18.6	66.1	–3.7%
– SOES	20.4	64.9	–5.9%

Interpretations:

R2MU is indispensable for iterative evidence aggregation,
ASAM enables allocation of compute to salient content and direct robustness gains,
SOES confers the largest marginal improvement, evidencing the impact of removing fixed positional encoding in favor of content-structured attention.

7. Synthesis and Significance

ReSSFormer, by composing recurrent inference with memory (R2MU), sparse adaptive attention (ASAM), and self-organizing structural encoding (SOES), advances the feasible range and robustness of long-context and structure-sensitive Transformer-based models within moderate compute and parameter budgets. These components collectively facilitate long-sequence reasoning, efficiency under heavy sequence lengths, resilience to context distractors, and generalization across unstructured and structured modalities, providing a compelling foundation for further development in scalable neural sequence modeling (You et al., 2 Oct 2025).

Markdown Upgrade to Chat

References (1)

ReSSFormer: A Recursive Sparse Structured Transformer for Scalable and Long-Context Reasoning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ReSSFormer.

ReSSFormer: Scalable Long-Context Transformer

1. Recurrent and Memory-Augmented Architecture

Recurrent Reasoning & Memory Unit (R2MU)

2. Adaptive Sparse Attention Module (ASAM)

3. Self-Organizing Encoder Structure (SOES)

4. Training Protocols and Experimental Setup

5. Empirical Evaluation and Comparative Analysis

Long-Context Reasoning

Language Modeling Efficiency

Structure Generalization

Robustness to Distractors

6. Ablation and Functional Decomposition

7. Synthesis and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

ReSSFormer: Scalable Long-Context Transformer

1. Recurrent and Memory-Augmented Architecture

Recurrent Reasoning & Memory Unit (R2MU)

2. Adaptive Sparse Attention Module (ASAM)

3. Self-Organizing Encoder Structure (SOES)

4. Training Protocols and Experimental Setup

5. Empirical Evaluation and Comparative Analysis

Long-Context Reasoning

Language Modeling Efficiency

Structure Generalization

Robustness to Distractors

6. Ablation and Functional Decomposition

7. Synthesis and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research