Enhanced Techniques for Long Document Classification Using Transformer-Based Models
Introduction to Long Document Classification Challenges
Long Document Classification (TrLDC) represents a significant challenge in the field of NLP. The core issues stem from the inability of traditional Transformer-based models, like BERT and its variants, which were primarily designed and pre-trained on relatively short text sequences (up to 512 tokens), to efficiently handle long documents containing multiple pages and paragraphs. The standard approach of truncating documents does not suffice, as it often results in the loss of crucial information. Moreover, the quadratic complexity of the self-attention mechanism in Transformers poses computational efficiency challenges. This paper undertakes a detailed examination of alternative methods aimed at mitigating these issues, focusing on sparse attention and hierarchical encoding strategies.
Evaluating Long Document Classification Approaches
Our investigation centers on comparing two main approaches for processing long documents with Transformer-based models: sparse-attention Transformers, exemplified by models such as Longformer, and hierarchical Transformers. Through experiments conducted across four distinct document classification datasets (MIMIC-III, ECtHR, Hyperpartisan, and 20 News), which span various domains, we provide insights into the efficacy and efficiency of these methods.
Sparse attention models, like Longformer, introduce a mix of local window-based attention and global attention, enabling the processing of up to 4096 tokens with a reduced computational overhead. Our analysis reveals that smaller local attention windows can significantly enhance model efficiency without compromising effectiveness. Furthermore, we find that a small set of additional tokens employing global attention can stabilize training.
Hierarchical Transformers, on the other hand, divide a document into smaller segments, each encoded separately before being aggregated into a comprehensive document representation. This approach demonstrated variable optimal segment lengths across different datasets, underscoring the importance of dataset-specific configurations. Additionally, allowing for overlapping segments appeared to alleviate the context fragmentation problem, showing contrast with strategies based on document structure like paragraph splitting which did not consistently yield improvements.
Practical Implications and Future Directions
Our findings offer practical guidance for applying Transformers to long document classification tasks. Particularly, Longformer presents a robust starting point due to its default configurations' effectiveness, while hierarchical Transformers require careful tuning but offer advantages in processing longer texts beyond the 4096 token limit of current sparse attention models.
The results also highlight the potential of task-adaptive pre-training (TAPT) in improving performance, notably on the MIMIC-III dataset where domain-specificity poses additional challenges. This suggests that TAPT could be a valuable step in adapting Transformer models to specialized domains where pre-training data may not comprehensively cover domain-specific content.
In summary, this paper contributes to a better understanding of the performance of Transformer-based models on long document classification tasks. While we report that Transformer-based models can outperform traditional CNN-based models on MIMIC-III, indicating a significant advancement in handling long documents, the choice between sparse attention and hierarchical approaches depends on specific task requirements and computational resources.
Looking ahead, further research could explore integrating these approaches with advanced classifiers that leverage label hierarchies and other domain-specific knowledge. Additionally, extending the processing capabilities to handle even longer sequences could open new avenues for applying Transformers to a broader range of long document classification tasks.