ReSSFormer: Scalable Long-Context Transformer
- ReSSFormer is a Transformer variant that integrates a recurrent reasoning & memory unit, adaptive sparse attention, and a self-organizing encoder to efficiently manage long-context tasks.
- It replaces traditional layered architectures with iterative state refinement and content-driven structure, drastically reducing computational overhead.
- Empirical evaluations show that ReSSFormer outperforms strong baselines in language modeling, multi-hop question answering, and structure generalization with a compact parameter budget.
ReSSFormer is a Transformer variant designed for scalable, efficient, and long-context reasoning by integrating three synergistic architectural modules: the Recurrent Reasoning & Memory Unit (R2MU), the Adaptive Sparse Attention Module (ASAM), and the Self-Organizing Encoder Structure (SOES). This framework departs fundamentally from standard layer-stacked Transformer architectures, eschewing dense attention and fixed positional encodings in favor of parameter sharing via recurrent inference, adaptive attention sparsity, and content-driven structure induction. These innovations enable robust performance across language modeling, long-context multi-hop question answering, and structure-sensitive tasks with a compact parameter budget and computational footprint (You et al., 2 Oct 2025).
1. Recurrent and Memory-Augmented Architecture
ReSSFormer replaces the conventional -layer deep stacking paradigm with a recurrent inference approach. Instead of computing representations as for , ReSSFormer iterates a single, shared "Block" for steps: where encapsulates multi-scale memory. With , parameter sharing constrains growth while iterative reasoning simulates deep inference.
Recurrent Reasoning & Memory Unit (R2MU)
R2MU equips each recurrent step with:
- Token-level cache: Stores recent token representations (e.g., via sliding window).
- Segment-level summary : Represents pooled informational context.
Pooling is defined as: where may be attention-weighted, projected, or top- selected. A gated update with learned updates the memory: The recurrence relation is: enabling iterative state refinement with bounded parameters and controlled memory.
2. Adaptive Sparse Attention Module (ASAM)
ASAM systematically reduces attention complexity and enhances focus by employing:
- Sparsity-inducing Activation: Replaces softmax with alternatives such as sparsemax or ,
resulting in attention matrices with many exact zeros.
- Top- Key Pruning: For each query , only the top keys (by dot product) are selected, driving computational and memory cost from to ():
- Per-Head Mixture-of-Experts (MoE) Routing: Each attention head has small "experts." A router produces a sparse ; only top experts are activated per query: Pruning at both token and expert levels ensures time and memory.
3. Self-Organizing Encoder Structure (SOES)
SOES eliminates explicit positional encodings, instead letting structure emerge from content-driven graph induction.
- Content-Driven Edge Weights: Defines a latent graph over tokens , with learned edge scores: where is typically an MLP or dot-product kernel.
- Structural Regularization: A regularization term penalizes abrupt changes in graph topology between time steps:
- Position-Free Attention: Attention patterns in ASAM are modulated by the emergent , disconnecting inference from token indices or positional priors.
4. Training Protocols and Experimental Setup
ReSSFormer employs AdamW with linear warmup, cosine decay, mixed precision, and early stopping. Training uses 8A100 GPUs. Key hyperparameters include recurrence, top- attention sparsity, memory slots, experts per head with activations, and a 125M parameter cap within 120B training FLOPs. Primary benchmarks:
- Long-context and multi-hop QA: NarrativeQA, HotpotQA, GSM8K
- Language modeling: Wikitext-103, PG-19 (4–8k tokens)
- Structure generalization: TabFact, OGB-Arxiv, shuffled paragraph QA
Baselines include GPT-2, Longformer, BigBird, Performer, Graphformer, and TabTransformer.
5. Empirical Evaluation and Comparative Analysis
ReSSFormer exhibits consistent advantages under compute- and parameter-matched conditions.
Long-Context Reasoning
ReSSFormer maintains 78% accuracy on up to 8k-token inputs, compared to 66–74% for strong alternatives. At 4k tokens, performance and computational efficiency are summarized below:
| Model | Accuracy (%) | Latency (ms) | FLOPs (G) |
|---|---|---|---|
| GPT-2 | 66.4 | 91 | 220 |
| Longformer | 69.7 | 104 | 190 |
| BigBird | 71.2 | 97 | 183 |
| RoPE-TF | 73.5 | 112 | 226 |
| ReSSFormer | 77.8 | 95 | 172 |
Language Modeling Efficiency
On Wikitext-103 and PG-19 with a 120B FLOP budget:
- ReSSFormer: 17.4/25.8 PPL, 162G FLOPs/step, 74 ms
- GPT-2: 19.2/28.1 PPL, 215G, 91 ms
- Performer and MoE baselines underperform in both PPL and efficiency.
Structure Generalization
SOES yields robustness to structural noise, outperforming position-aware models by 3–6% on TabFact, OGB-Arxiv, and shuffled-QA. This demonstrates transfer capacity across table, graph, and permuted text input modalities.
Robustness to Distractors
With injected irrelevant content, ReSSFormer exhibits loss in relative accuracy, compared to 15–20% for non-sparse baseline architectures.
6. Ablation and Functional Decomposition
Ablative studies on Wikitext-103 (perplexity, lower is better) and HotpotQA (EM, higher is better):
| Model Variant | PPL | EM (%) | Rel. |
|---|---|---|---|
| Full | 17.4 | 71.2 | – |
| – R2MU | 19.0 | 67.8 | –2.5% |
| – ASAM | 18.6 | 66.1 | –3.7% |
| – SOES | 20.4 | 64.9 | –5.9% |
Interpretations:
- R2MU is indispensable for iterative evidence aggregation,
- ASAM enables allocation of compute to salient content and direct robustness gains,
- SOES confers the largest marginal improvement, evidencing the impact of removing fixed positional encoding in favor of content-structured attention.
7. Synthesis and Significance
ReSSFormer, by composing recurrent inference with memory (R2MU), sparse adaptive attention (ASAM), and self-organizing structural encoding (SOES), advances the feasible range and robustness of long-context and structure-sensitive Transformer-based models within moderate compute and parameter budgets. These components collectively facilitate long-sequence reasoning, efficiency under heavy sequence lengths, resilience to context distractors, and generalization across unstructured and structured modalities, providing a compelling foundation for further development in scalable neural sequence modeling (You et al., 2 Oct 2025).