Chinese Long Document Classification with ERNIE-DOC
- Chinese long document classification is the task of automatically categorizing extended Chinese texts, such as news articles and app descriptions, into predefined topics.
- ERNIE-DOC overcomes Transformer limitations by employing a retrospective feed and same-layer recurrence to integrate document-level context without excessive memory use.
- Empirical results on THUCNews and IFLYTEK datasets demonstrate that ERNIE-DOC achieves superior accuracy compared to traditional models in long text classification.
Chinese long document classification refers to the automatic categorization of lengthy Chinese-language textual documents—such as news articles, app descriptions, or other extended discourse—into predefined topic or subject labels using advanced deep learning architectures. Traditional Transformers, while state-of-the-art for short- to medium-length sequences, are ill-suited for documents exceeding several hundred tokens due to quadratic time and memory complexity as well as the context fragmentation problem. An effective solution is provided by ERNIE-DOC, a retrospective long-document modeling Transformer that enables document-level context integration and efficient end-to-end learning for Chinese long document classification, outperforming conventional models across multiple benchmarks (Ding et al., 2020).
1. Limitations of Standard Transformers in Long Document Modeling
Standard bidirectional Transformers, including BERT and RoBERTa, operate under a token-length constraint (typically 512), processing longer texts by truncation or independent segmentation. When a document is divided into non-overlapping chunks, each chunk is modeled independently, causing context fragmentation—critical inter-segment dependencies are missed and global semantics are lost. Furthermore, the cost for full self-attention on a document of length is for both time and memory, rendering such models impractical for long documents encountered in Chinese NLP applications (Ding et al., 2020).
2. ERNIE-DOC: Model Architecture and Mechanisms
ERNIE-DOC builds upon Recurrence Transformers by introducing two core enhancements specifically tailored for long-document context integration:
- Retrospective Feed Mechanism: After an initial “skimming” pass caching each segment’s hidden states, a second pass revisits each segment. In this pass, each segment’s input is augmented with a representation summarizing the entire document (obtained by caching every th layer’s top hidden states), enabling document-wide bidirectional information flow without exponential resource increase.
- Enhanced Recurrence Mechanism: ERNIE-DOC replaces the traditional shift-down recurrence (cross-segment recurrence only along layers) with a same-layer recurrence, so every Transformer layer reuses its own prior segment’s final hidden state:
This allows dependencies to propagate across all segments, expanding the effective context length without architectural complexity or new attention patterns (Ding et al., 2020).
| Transformer Variant | Segment Recurrence | Context Length Scaling |
|---|---|---|
| Standard Transformer (BERT/RoBERTa) | None | O(512 tokens) |
| Recurrence Transformer (Dai et al. 2019) | Cross-layer | N × segment length |
| ERNIE-DOC | Same-layer | Unbounded (practically) |
3. Pretraining Objectives for Context and Order
ERNIE-DOC employs two primary pretraining objectives:
- Masked Language Modeling (MLM): Standard BERT-style masking and prediction using document-level context post-retrospective feed and recurrence.
- Document-Aware Segment-Reordering Objective: Documents are randomly partitioned and permuted, then presented as sliding-window segments. At the final segment, the model predicts the permutation using the [CLS] token and a multi-layer perceptron (MLP), with the loss:
where is the correct order and is the permuted document. This objective enforces modeling of long-range, cross-segment dependencies and document coherence, which is critical for long document classification (Ding et al., 2020).
4. Application to Chinese Long-Document Classification
Fine-tuning for Chinese long text employs the same two-pass, segment-based ERNIE-DOC mechanism:
- Datasets: Evaluation was conducted on IFLYTEK (∼17,000 app descriptions, 119 categories, average ≈2,000 Chinese characters) and a THUCNews subset (10 news topics, 5,000 documents per topic) (Ding et al., 2020).
- Segment Aggregation: Documents are split into 128-token segments for both passes. The [CLS] vector of the last segment in the retrospective (second) pass is input into a two-layer MLP with softmax for classification.
- Optimization: Cross-entropy loss for multi-class classification or binary cross-entropy for binary classification. No explicit pooling or hierarchy beyond the [CLS] selection layer.
Fine-tuning hyperparameters typically include: 12 layers, hidden size 768, segment length 128, memory 128, batch size 32–64, learning rate with warmup and decay, dropout 0.1, and 3–5 epochs (Ding et al., 2020).
5. Empirical Performance and Benchmarks
In Chinese long-document classification tasks, ERNIE-DOC demonstrates substantial improvements:
| Model | THUCNews Acc. | IFLYTEK Acc. |
|---|---|---|
| BERT-wwm-ext | 97.6% | 60.3% |
| RoBERTa-wwm-ext | 97.6% | 68.5% |
| ERNIE 2.0 | 98.0% | 61.7% |
| ERNIE-DOC (base) | 98.3% | 62.4% |
- On THUCNews, ERNIE-DOC exceeds BERT-wwm-ext by +0.6 percentage points and ERNIE 2.0 by +0.3 points.
- On IFLYTEK, the gain is +2.1 points over BERT and +0.7 over ERNIE 2.0. These quantitative results indicate that ERNIE-DOC’s document-level context modeling and reordering objective yield state-of-the-art performance for Chinese long-text classification (Ding et al., 2020).
6. Complexity Analysis and Efficiency
- Self-Attention Reduction: ERNIE-DOC processes segments of size , incorporating memory length 0. Each segment’s attention cost is 1 per pass, for two passes total. Thus,
2
3
This is linear in document length, comparable to sparse-attention approaches, while enabling dense document-context integration (Ding et al., 2020).
- Trade-Offs: The retrospective feed requires a second pass but does not increase overall memory, as segments are processed sequentially. The enhanced recurrence lifts the N-layer limitation with no additional parameters.
7. Significance and Prospects
ERNIE-DOC clarifies core architectural and methodological principles necessary for Chinese long document classification, resolving context fragmentation and practical resource limitations endemic to standard Transformer models. Its retrospective mechanism and recurrence unlock document-level bidirectionality, and the segment-reordering objective explicitly teaches the model to capture inter-segment order—features directly reflected in empirical gains. These advances provide a foundation for further scaling, more linguistically sophisticated pretraining tasks, or adaptations to domain-specific Chinese corpora (Ding et al., 2020). A plausible implication is that similar recurrence and retrospective principles could generalize to other languages or document-level sequence modeling tasks where context fragmentation remains a bottleneck.