Chinese Long Document Classification with ERNIE-DOC

Updated 19 April 2026

Chinese long document classification is the task of automatically categorizing extended Chinese texts, such as news articles and app descriptions, into predefined topics.
ERNIE-DOC overcomes Transformer limitations by employing a retrospective feed and same-layer recurrence to integrate document-level context without excessive memory use.
Empirical results on THUCNews and IFLYTEK datasets demonstrate that ERNIE-DOC achieves superior accuracy compared to traditional models in long text classification.

Chinese long document classification refers to the automatic categorization of lengthy Chinese-language textual documents—such as news articles, app descriptions, or other extended discourse—into predefined topic or subject labels using advanced deep learning architectures. Traditional Transformers, while state-of-the-art for short- to medium-length sequences, are ill-suited for documents exceeding several hundred tokens due to quadratic time and memory complexity as well as the context fragmentation problem. An effective solution is provided by ERNIE-DOC, a retrospective long-document modeling Transformer that enables document-level context integration and efficient end-to-end learning for Chinese long document classification, outperforming conventional models across multiple benchmarks (Ding et al., 2020).

1. Limitations of Standard Transformers in Long Document Modeling

Standard bidirectional Transformers, including BERT and RoBERTa, operate under a token-length constraint (typically 512), processing longer texts by truncation or independent segmentation. When a document is divided into non-overlapping chunks, each chunk is modeled independently, causing context fragmentation—critical inter-segment dependencies are missed and global semantics are lost. Furthermore, the cost for full self-attention on a document of length $L$ is $O(L^2)$ for both time and memory, rendering such models impractical for long documents encountered in Chinese NLP applications (Ding et al., 2020).

2. ERNIE-DOC: Model Architecture and Mechanisms

ERNIE-DOC builds upon Recurrence Transformers by introducing two core enhancements specifically tailored for long-document context integration:

Retrospective Feed Mechanism: After an initial “skimming” pass caching each segment’s hidden states, a second pass revisits each segment. In this pass, each segment’s input is augmented with a representation summarizing the entire document (obtained by caching every $N$ th layer’s top hidden states), enabling document-wide bidirectional information flow without exponential resource increase.
Enhanced Recurrence Mechanism: ERNIE-DOC replaces the traditional shift-down recurrence (cross-segment recurrence only along layers) with a same-layer recurrence, so every Transformer layer reuses its own prior segment’s final hidden state:

$h_n^{(t)} = \text{TransformerBlock}([\,\text{SG}(h_n^{(t-1)});\,h_{n-1}^{(t)}\,])$

This allows dependencies to propagate across all segments, expanding the effective context length without architectural complexity or new attention patterns (Ding et al., 2020).

Transformer Variant	Segment Recurrence	Context Length Scaling
Standard Transformer (BERT/RoBERTa)	None	O(512 tokens)
Recurrence Transformer (Dai et al. 2019)	Cross-layer	N × segment length
ERNIE-DOC	Same-layer	Unbounded (practically)

3. Pretraining Objectives for Context and Order

ERNIE-DOC employs two primary pretraining objectives:

Masked Language Modeling (MLM): Standard BERT-style masking and prediction using document-level context post-retrospective feed and recurrence.
Document-Aware Segment-Reordering Objective: Documents are randomly partitioned and permuted, then presented as sliding-window segments. At the final segment, the model predicts the permutation using the [CLS] token and a multi-layer perceptron (MLP), with the loss:

$L_{\text{reorder}} = -\log p_\theta(\pi \mid \tilde D)$

where $\pi$ is the correct order and $\tilde D$ is the permuted document. This objective enforces modeling of long-range, cross-segment dependencies and document coherence, which is critical for long document classification (Ding et al., 2020).

4. Application to Chinese Long-Document Classification

Fine-tuning for Chinese long text employs the same two-pass, segment-based ERNIE-DOC mechanism:

Datasets: Evaluation was conducted on IFLYTEK (∼17,000 app descriptions, 119 categories, average ≈2,000 Chinese characters) and a THUCNews subset (10 news topics, 5,000 documents per topic) (Ding et al., 2020).
Segment Aggregation: Documents are split into 128-token segments for both passes. The [CLS] vector of the last segment in the retrospective (second) pass is input into a two-layer MLP with softmax for classification.
Optimization: Cross-entropy loss for multi-class classification or binary cross-entropy for binary classification. No explicit pooling or hierarchy beyond the [CLS] selection layer.

Fine-tuning hyperparameters typically include: 12 layers, hidden size 768, segment length 128, memory 128, batch size 32–64, learning rate $1\text{e}^{-4}$ with warmup and decay, dropout 0.1, and 3–5 epochs (Ding et al., 2020).

5. Empirical Performance and Benchmarks

In Chinese long-document classification tasks, ERNIE-DOC demonstrates substantial improvements:

Model	THUCNews Acc.	IFLYTEK Acc.
BERT-wwm-ext	97.6%	60.3%
RoBERTa-wwm-ext	97.6%	68.5%
ERNIE 2.0	98.0%	61.7%
ERNIE-DOC (base)	98.3%	62.4%

On THUCNews, ERNIE-DOC exceeds BERT-wwm-ext by +0.6 percentage points and ERNIE 2.0 by +0.3 points.
On IFLYTEK, the gain is +2.1 points over BERT and +0.7 over ERNIE 2.0. These quantitative results indicate that ERNIE-DOC’s document-level context modeling and reordering objective yield state-of-the-art performance for Chinese long-text classification (Ding et al., 2020).

6. Complexity Analysis and Efficiency

Self-Attention Reduction: ERNIE-DOC processes $L/s$ segments of size $s=128$ , incorporating memory length $O(L^2)$ 0. Each segment’s attention cost is $O(L^2)$ 1 per pass, for two passes total. Thus,

$O(L^2)$ 2

$O(L^2)$ 3

This is linear in document length, comparable to sparse-attention approaches, while enabling dense document-context integration (Ding et al., 2020).

Trade-Offs: The retrospective feed requires a second pass but does not increase overall memory, as segments are processed sequentially. The enhanced recurrence lifts the N-layer limitation with no additional parameters.

7. Significance and Prospects

ERNIE-DOC clarifies core architectural and methodological principles necessary for Chinese long document classification, resolving context fragmentation and practical resource limitations endemic to standard Transformer models. Its retrospective mechanism and recurrence unlock document-level bidirectionality, and the segment-reordering objective explicitly teaches the model to capture inter-segment order—features directly reflected in empirical gains. These advances provide a foundation for further scaling, more linguistically sophisticated pretraining tasks, or adaptations to domain-specific Chinese corpora (Ding et al., 2020). A plausible implication is that similar recurrence and retrospective principles could generalize to other languages or document-level sequence modeling tasks where context fragmentation remains a bottleneck.

Markdown Report Issue Upgrade to Chat

References (1)

ERNIE-Doc: A Retrospective Long-Document Modeling Transformer (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Chinese Long Document Classification.