Poolingformer: Efficient Long-Doc Attention

Updated 22 May 2026

Poolingformer is a neural architecture that employs a two-level attention scheme combining local sliding-window and pooled global attention for long document modeling.
It uses sliding-window self-attention to capture detailed local dependencies and pooling self-attention to efficiently aggregate broader context with linear complexity.
Empirical results demonstrate state-of-the-art performance in long-sequence question answering and summarization, supporting context lengths up to 16k tokens.

Poolingformer is a neural architecture designed for efficient long document modeling with linear time and memory complexity. Developed as an alternative to standard Transformer models, Poolingformer employs a two-level attention scheme that combines local sliding-window attention with pooled global attention to increase the receptive field while controlling computational costs. Empirical results indicate state-of-the-art performance on multiple long-sequence question answering and summarization tasks, while supporting context lengths up to 16k tokens with efficiency superior to prior linear-time models (Zhang et al., 2021).

1. Architectural Overview

Poolingformer replaces standard full self-attention with a hierarchical two-level attention schema. At each layer, the model computes two distinct outputs for each token:

The first level applies sliding-window self-attention to aggregate information from a local neighborhood of size $w_1$ .
The second level applies pooling self-attention to a broader window of size $w_2$ , where keys and values are compressed via a pooling operation with kernel size $\kappa$ and stride $\xi$ .

For token $i$ , the attention output is given by the sum of the local output $y_i$ (first level) and the pooled global output $z_i$ (second level), with a residual connection from the layer input. This two-level design provides both fine-grained and coarse-grained context aggregation in a computationally efficient manner (Zhang et al., 2021).

2. Detailed Mechanisms

2.1 First-Level: Sliding-Window Self-Attention

Given input embeddings $\mathbf{X} = [x_1, \ldots, x_n]$ , queries, keys, and values are computed as standard linear projections. For token $i$ , the sliding window $\mathcal{N}(i, w_1)$ spans indices $w_2$ 0. The first-level attention is:

$w_2$ 1

This mechanism restricts computations to $w_2$ 2 per layer and captures local dependencies.

2.2 Second-Level: Pooling Self-Attention

The intermediate outputs $w_2$ 3 are re-projected to produce new queries, keys, and values. For token $w_2$ 4, a larger window $w_2$ 5 is selected, and the sequence within this window undergoes a pooling operation to compress the keys and values:

$w_2$ 6

Pooling methods include mean pooling, max pooling, and strided convolution (LDConv). The pooled attention for token $w_2$ 7 is:

$w_2$ 8

This approach achieves $w_2$ 9 complexity per layer.

2.3 Output and Residual Structure

The final output for each token after one Poolingformer layer is $\kappa$ 0, ensuring integration of both fine and coarse context, along with a residual connection. The ratio of local to pooled attention can be controlled via $\kappa$ 1, $\kappa$ 2, and $\kappa$ 3.

3. Computational Complexity and Comparison

Poolingformer efficiently handles long input sequences by reducing both time and space complexity per layer to linear in sequence length $\kappa$ 4. The table below summarizes complexities for standard baselines:

Model	Complexity
Transformer	$\kappa$ 5
Reformer, Cluster	$\kappa$ 6
Longformer, BigBird	$\kappa$ 7
Poolingformer	$\kappa$ 8

Poolingformer’s maximum input length (16k tokens) is achieved with hardware demands similar to Longformer's 4k setting, providing higher throughput and reduced memory when the pooling compression factor $\kappa$ 9 is small relative to window size $\xi$ 0 (Zhang et al., 2021).

4. Empirical Evaluation

4.1 Datasets and Metrics

Poolingformer was evaluated on:

Natural Questions (NQ): F1 score (long/short answers)
TyDi QA: F1 score (passage/minimal span, multilingual)
arXiv Summarization: ROUGE-1, ROUGE-2, ROUGE-L on long scientific documents

4.2 Main Results

Poolingformer achieved state-of-the-art (SOTA) results on three official leaderboards:

Task	Poolingformer	Previous SOTA	Gain
NQ Long-Answer F1	79.8	77.9	+1.9
TyDi QA Passage F1	79.5	77.6	+1.9
TyDi QA Minimal F1	67.6	66.0	+1.6
arXiv Summ. ROUGE-1 (16k in)	48.47	46.63	+1.84
arXiv Summ. ROUGE-2	20.23	19.62	+0.61
arXiv Summ. ROUGE-L	42.69	41.83	+0.86

On ablation, two-level attention (e.g., $\xi$ 1, $\xi$ 2, $\xi$ 3) outperformed single-window baselines, indicating the utility of distant but coarsely aggregated context.

4.3 Ablation Studies

Increasing the pooling window $\xi$ 4 and applying coarse pooling improved long-answer F1 up to a point. Excessive pooling or stacking too many pooling layers (e.g., all 24) led to degraded performance, indicating a trade-off between receptive field and retention of pretrained knowledge.
Pooling methods (mean, max, LDConv) delivered similar results; LDConv yielded the best outcomes on QA.

5. Practical Implications and Use Cases

Poolingformer is suited for tasks requiring very long context windows, such as long-document question answering and summarization, where both local and global context are critical. Its linear time and memory profile makes it suitable for deployments with resource constraints or requirements for high throughput (Zhang et al., 2021).

6. Limitations and Open Problems

Pooling introduces coarse aggregation and may lose fine detail when the stride $\xi$ 5 or window size $\xi$ 6 is large. Overuse of pooling layers can result in forgetting pretrained weights and reduced performance. The architecture also introduces additional implementation complexity, including the management of two attention streams per layer and efficient pooling operations for large windows.

Potential future directions include theoretical analysis contrasting multi-level and single-level attention, extending pooling-based attention to other modalities (vision, audio, music), and developing adaptive pooling strategies or learned window sizes that better balance local and global dependency modeling (Zhang et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Poolingformer: Long Document Modeling with Pooling Attention (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Poolingformer.

Poolingformer: Efficient Long-Doc Attention

1. Architectural Overview

2. Detailed Mechanisms

2.1 First-Level: Sliding-Window Self-Attention

2.2 Second-Level: Pooling Self-Attention

2.3 Output and Residual Structure

3. Computational Complexity and Comparison

4. Empirical Evaluation

4.1 Datasets and Metrics

4.2 Main Results

4.3 Ablation Studies

5. Practical Implications and Use Cases

6. Limitations and Open Problems

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Poolingformer: Efficient Long-Doc Attention

1. Architectural Overview

2. Detailed Mechanisms

2.1 First-Level: Sliding-Window Self-Attention

2.2 Second-Level: Pooling Self-Attention

2.3 Output and Residual Structure

3. Computational Complexity and Comparison

4. Empirical Evaluation

4.1 Datasets and Metrics

4.2 Main Results

4.3 Ablation Studies

5. Practical Implications and Use Cases

6. Limitations and Open Problems

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research