Poolingformer: Efficient Long-Doc Attention
- Poolingformer is a neural architecture that employs a two-level attention scheme combining local sliding-window and pooled global attention for long document modeling.
- It uses sliding-window self-attention to capture detailed local dependencies and pooling self-attention to efficiently aggregate broader context with linear complexity.
- Empirical results demonstrate state-of-the-art performance in long-sequence question answering and summarization, supporting context lengths up to 16k tokens.
Poolingformer is a neural architecture designed for efficient long document modeling with linear time and memory complexity. Developed as an alternative to standard Transformer models, Poolingformer employs a two-level attention scheme that combines local sliding-window attention with pooled global attention to increase the receptive field while controlling computational costs. Empirical results indicate state-of-the-art performance on multiple long-sequence question answering and summarization tasks, while supporting context lengths up to 16k tokens with efficiency superior to prior linear-time models (Zhang et al., 2021).
1. Architectural Overview
Poolingformer replaces standard full self-attention with a hierarchical two-level attention schema. At each layer, the model computes two distinct outputs for each token:
- The first level applies sliding-window self-attention to aggregate information from a local neighborhood of size .
- The second level applies pooling self-attention to a broader window of size , where keys and values are compressed via a pooling operation with kernel size and stride .
For token , the attention output is given by the sum of the local output (first level) and the pooled global output (second level), with a residual connection from the layer input. This two-level design provides both fine-grained and coarse-grained context aggregation in a computationally efficient manner (Zhang et al., 2021).
2. Detailed Mechanisms
2.1 First-Level: Sliding-Window Self-Attention
Given input embeddings , queries, keys, and values are computed as standard linear projections. For token , the sliding window spans indices 0. The first-level attention is:
1
This mechanism restricts computations to 2 per layer and captures local dependencies.
2.2 Second-Level: Pooling Self-Attention
The intermediate outputs 3 are re-projected to produce new queries, keys, and values. For token 4, a larger window 5 is selected, and the sequence within this window undergoes a pooling operation to compress the keys and values:
6
Pooling methods include mean pooling, max pooling, and strided convolution (LDConv). The pooled attention for token 7 is:
8
This approach achieves 9 complexity per layer.
2.3 Output and Residual Structure
The final output for each token after one Poolingformer layer is 0, ensuring integration of both fine and coarse context, along with a residual connection. The ratio of local to pooled attention can be controlled via 1, 2, and 3.
3. Computational Complexity and Comparison
Poolingformer efficiently handles long input sequences by reducing both time and space complexity per layer to linear in sequence length 4. The table below summarizes complexities for standard baselines:
| Model | Complexity |
|---|---|
| Transformer | 5 |
| Reformer, Cluster | 6 |
| Longformer, BigBird | 7 |
| Poolingformer | 8 |
Poolingformer’s maximum input length (16k tokens) is achieved with hardware demands similar to Longformer's 4k setting, providing higher throughput and reduced memory when the pooling compression factor 9 is small relative to window size 0 (Zhang et al., 2021).
4. Empirical Evaluation
4.1 Datasets and Metrics
Poolingformer was evaluated on:
- Natural Questions (NQ): F1 score (long/short answers)
- TyDi QA: F1 score (passage/minimal span, multilingual)
- arXiv Summarization: ROUGE-1, ROUGE-2, ROUGE-L on long scientific documents
4.2 Main Results
Poolingformer achieved state-of-the-art (SOTA) results on three official leaderboards:
| Task | Poolingformer | Previous SOTA | Gain |
|---|---|---|---|
| NQ Long-Answer F1 | 79.8 | 77.9 | +1.9 |
| TyDi QA Passage F1 | 79.5 | 77.6 | +1.9 |
| TyDi QA Minimal F1 | 67.6 | 66.0 | +1.6 |
| arXiv Summ. ROUGE-1 (16k in) | 48.47 | 46.63 | +1.84 |
| arXiv Summ. ROUGE-2 | 20.23 | 19.62 | +0.61 |
| arXiv Summ. ROUGE-L | 42.69 | 41.83 | +0.86 |
On ablation, two-level attention (e.g., 1, 2, 3) outperformed single-window baselines, indicating the utility of distant but coarsely aggregated context.
4.3 Ablation Studies
- Increasing the pooling window 4 and applying coarse pooling improved long-answer F1 up to a point. Excessive pooling or stacking too many pooling layers (e.g., all 24) led to degraded performance, indicating a trade-off between receptive field and retention of pretrained knowledge.
- Pooling methods (mean, max, LDConv) delivered similar results; LDConv yielded the best outcomes on QA.
5. Practical Implications and Use Cases
Poolingformer is suited for tasks requiring very long context windows, such as long-document question answering and summarization, where both local and global context are critical. Its linear time and memory profile makes it suitable for deployments with resource constraints or requirements for high throughput (Zhang et al., 2021).
6. Limitations and Open Problems
Pooling introduces coarse aggregation and may lose fine detail when the stride 5 or window size 6 is large. Overuse of pooling layers can result in forgetting pretrained weights and reduced performance. The architecture also introduces additional implementation complexity, including the management of two attention streams per layer and efficient pooling operations for large windows.
Potential future directions include theoretical analysis contrasting multi-level and single-level attention, extending pooling-based attention to other modalities (vision, audio, music), and developing adaptive pooling strategies or learned window sizes that better balance local and global dependency modeling (Zhang et al., 2021).