Papers
Topics
Authors
Recent
Search
2000 character limit reached

Poolingformer: Efficient Long-Doc Attention

Updated 22 May 2026
  • Poolingformer is a neural architecture that employs a two-level attention scheme combining local sliding-window and pooled global attention for long document modeling.
  • It uses sliding-window self-attention to capture detailed local dependencies and pooling self-attention to efficiently aggregate broader context with linear complexity.
  • Empirical results demonstrate state-of-the-art performance in long-sequence question answering and summarization, supporting context lengths up to 16k tokens.

Poolingformer is a neural architecture designed for efficient long document modeling with linear time and memory complexity. Developed as an alternative to standard Transformer models, Poolingformer employs a two-level attention scheme that combines local sliding-window attention with pooled global attention to increase the receptive field while controlling computational costs. Empirical results indicate state-of-the-art performance on multiple long-sequence question answering and summarization tasks, while supporting context lengths up to 16k tokens with efficiency superior to prior linear-time models (Zhang et al., 2021).

1. Architectural Overview

Poolingformer replaces standard full self-attention with a hierarchical two-level attention schema. At each layer, the model computes two distinct outputs for each token:

  • The first level applies sliding-window self-attention to aggregate information from a local neighborhood of size w1w_1.
  • The second level applies pooling self-attention to a broader window of size w2w_2, where keys and values are compressed via a pooling operation with kernel size κ\kappa and stride ξ\xi.

For token ii, the attention output is given by the sum of the local output yiy_i (first level) and the pooled global output ziz_i (second level), with a residual connection from the layer input. This two-level design provides both fine-grained and coarse-grained context aggregation in a computationally efficient manner (Zhang et al., 2021).

2. Detailed Mechanisms

2.1 First-Level: Sliding-Window Self-Attention

Given input embeddings X=[x1,…,xn]\mathbf{X} = [x_1, \ldots, x_n], queries, keys, and values are computed as standard linear projections. For token ii, the sliding window N(i,w1)\mathcal{N}(i, w_1) spans indices w2w_20. The first-level attention is:

w2w_21

This mechanism restricts computations to w2w_22 per layer and captures local dependencies.

2.2 Second-Level: Pooling Self-Attention

The intermediate outputs w2w_23 are re-projected to produce new queries, keys, and values. For token w2w_24, a larger window w2w_25 is selected, and the sequence within this window undergoes a pooling operation to compress the keys and values:

w2w_26

Pooling methods include mean pooling, max pooling, and strided convolution (LDConv). The pooled attention for token w2w_27 is:

w2w_28

This approach achieves w2w_29 complexity per layer.

2.3 Output and Residual Structure

The final output for each token after one Poolingformer layer is κ\kappa0, ensuring integration of both fine and coarse context, along with a residual connection. The ratio of local to pooled attention can be controlled via κ\kappa1, κ\kappa2, and κ\kappa3.

3. Computational Complexity and Comparison

Poolingformer efficiently handles long input sequences by reducing both time and space complexity per layer to linear in sequence length κ\kappa4. The table below summarizes complexities for standard baselines:

Model Complexity
Transformer κ\kappa5
Reformer, Cluster κ\kappa6
Longformer, BigBird κ\kappa7
Poolingformer κ\kappa8

Poolingformer’s maximum input length (16k tokens) is achieved with hardware demands similar to Longformer's 4k setting, providing higher throughput and reduced memory when the pooling compression factor κ\kappa9 is small relative to window size ξ\xi0 (Zhang et al., 2021).

4. Empirical Evaluation

4.1 Datasets and Metrics

Poolingformer was evaluated on:

  • Natural Questions (NQ): F1 score (long/short answers)
  • TyDi QA: F1 score (passage/minimal span, multilingual)
  • arXiv Summarization: ROUGE-1, ROUGE-2, ROUGE-L on long scientific documents

4.2 Main Results

Poolingformer achieved state-of-the-art (SOTA) results on three official leaderboards:

Task Poolingformer Previous SOTA Gain
NQ Long-Answer F1 79.8 77.9 +1.9
TyDi QA Passage F1 79.5 77.6 +1.9
TyDi QA Minimal F1 67.6 66.0 +1.6
arXiv Summ. ROUGE-1 (16k in) 48.47 46.63 +1.84
arXiv Summ. ROUGE-2 20.23 19.62 +0.61
arXiv Summ. ROUGE-L 42.69 41.83 +0.86

On ablation, two-level attention (e.g., ξ\xi1, ξ\xi2, ξ\xi3) outperformed single-window baselines, indicating the utility of distant but coarsely aggregated context.

4.3 Ablation Studies

  • Increasing the pooling window ξ\xi4 and applying coarse pooling improved long-answer F1 up to a point. Excessive pooling or stacking too many pooling layers (e.g., all 24) led to degraded performance, indicating a trade-off between receptive field and retention of pretrained knowledge.
  • Pooling methods (mean, max, LDConv) delivered similar results; LDConv yielded the best outcomes on QA.

5. Practical Implications and Use Cases

Poolingformer is suited for tasks requiring very long context windows, such as long-document question answering and summarization, where both local and global context are critical. Its linear time and memory profile makes it suitable for deployments with resource constraints or requirements for high throughput (Zhang et al., 2021).

6. Limitations and Open Problems

Pooling introduces coarse aggregation and may lose fine detail when the stride ξ\xi5 or window size ξ\xi6 is large. Overuse of pooling layers can result in forgetting pretrained weights and reduced performance. The architecture also introduces additional implementation complexity, including the management of two attention streams per layer and efficient pooling operations for large windows.

Potential future directions include theoretical analysis contrasting multi-level and single-level attention, extending pooling-based attention to other modalities (vision, audio, music), and developing adaptive pooling strategies or learned window sizes that better balance local and global dependency modeling (Zhang et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Poolingformer.