Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Revisiting Transformer-based Models for Long Document Classification (2204.06683v2)

Published 14 Apr 2022 in cs.CL

Abstract: The recent literature in text classification is biased towards short text sequences (e.g., sentences or paragraphs). In real-world applications, multi-page multi-paragraph documents are common and they cannot be efficiently encoded by vanilla Transformer-based models. We compare different Transformer-based Long Document Classification (TrLDC) approaches that aim to mitigate the computational overhead of vanilla transformers to encode much longer text, namely sparse attention and hierarchical encoding methods. We examine several aspects of sparse attention (e.g., size of local attention window, use of global attention) and hierarchical (e.g., document splitting strategy) transformers on four document classification datasets covering different domains. We observe a clear benefit from being able to process longer text, and, based on our results, we derive practical advice of applying Transformer-based models on long document classification tasks.

Enhanced Techniques for Long Document Classification Using Transformer-Based Models

Introduction to Long Document Classification Challenges

Long Document Classification (TrLDC) represents a significant challenge in the field of NLP. The core issues stem from the inability of traditional Transformer-based models, like BERT and its variants, which were primarily designed and pre-trained on relatively short text sequences (up to 512 tokens), to efficiently handle long documents containing multiple pages and paragraphs. The standard approach of truncating documents does not suffice, as it often results in the loss of crucial information. Moreover, the quadratic complexity of the self-attention mechanism in Transformers poses computational efficiency challenges. This paper undertakes a detailed examination of alternative methods aimed at mitigating these issues, focusing on sparse attention and hierarchical encoding strategies.

Evaluating Long Document Classification Approaches

Our investigation centers on comparing two main approaches for processing long documents with Transformer-based models: sparse-attention Transformers, exemplified by models such as Longformer, and hierarchical Transformers. Through experiments conducted across four distinct document classification datasets (MIMIC-III, ECtHR, Hyperpartisan, and 20 News), which span various domains, we provide insights into the efficacy and efficiency of these methods.

Sparse attention models, like Longformer, introduce a mix of local window-based attention and global attention, enabling the processing of up to 4096 tokens with a reduced computational overhead. Our analysis reveals that smaller local attention windows can significantly enhance model efficiency without compromising effectiveness. Furthermore, we find that a small set of additional tokens employing global attention can stabilize training.

Hierarchical Transformers, on the other hand, divide a document into smaller segments, each encoded separately before being aggregated into a comprehensive document representation. This approach demonstrated variable optimal segment lengths across different datasets, underscoring the importance of dataset-specific configurations. Additionally, allowing for overlapping segments appeared to alleviate the context fragmentation problem, showing contrast with strategies based on document structure like paragraph splitting which did not consistently yield improvements.

Practical Implications and Future Directions

Our findings offer practical guidance for applying Transformers to long document classification tasks. Particularly, Longformer presents a robust starting point due to its default configurations' effectiveness, while hierarchical Transformers require careful tuning but offer advantages in processing longer texts beyond the 4096 token limit of current sparse attention models.

The results also highlight the potential of task-adaptive pre-training (TAPT) in improving performance, notably on the MIMIC-III dataset where domain-specificity poses additional challenges. This suggests that TAPT could be a valuable step in adapting Transformer models to specialized domains where pre-training data may not comprehensively cover domain-specific content.

In summary, this paper contributes to a better understanding of the performance of Transformer-based models on long document classification tasks. While we report that Transformer-based models can outperform traditional CNN-based models on MIMIC-III, indicating a significant advancement in handling long documents, the choice between sparse attention and hierarchical approaches depends on specific task requirements and computational resources.

Looking ahead, further research could explore integrating these approaches with advanced classifiers that leverage label hierarchies and other domain-specific knowledge. Additionally, extending the processing capabilities to handle even longer sequences could open new avenues for applying Transformers to a broader range of long document classification tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Xiang Dai (18 papers)
  2. Ilias Chalkidis (40 papers)
  3. Sune Darkner (24 papers)
  4. Desmond Elliott (53 papers)
Citations (63)