BertGCN: Transductive Text Classification by Combining GCN and BERT (2105.05727v4)

Published 12 May 2021 in cs.CL

Abstract: In this work, we propose BertGCN, a model that combines large scale pretraining and transductive learning for text classification. BertGCN constructs a heterogeneous graph over the dataset and represents documents as nodes using BERT representations. By jointly training the BERT and GCN modules within BertGCN, the proposed model is able to leverage the advantages of both worlds: large-scale pretraining which takes the advantage of the massive amount of raw data and transductive learning which jointly learns representations for both training data and unlabeled test data by propagating label influence through graph convolution. Experiments show that BertGCN achieves SOTA performances on a wide range of text classification datasets. Code is available at https://github.com/ZeroRin/BertGCN.

Citations (201)

View on Semantic Scholar

Summary

The paper introduces a novel model that fuses BERT's contextual embeddings with GCN's structural learning for transductive text classification.
It constructs text graphs (e.g., word-document graphs) to capture relational information and refines node features using a combination of transformer and GCN layers.
Empirical evaluations on benchmarks like 20NG, Reuters, and OHSUMED demonstrate the model's improved accuracy over traditional graph and transformer-based approaches.

Based on the title "BertGCN: Transductive Text Classification by Combining GCN and BERT" (BertGCN: Transductive Text Classification by Combining GCN and BERT, 2021), this paper likely introduces a method that integrates the capabilities of BERT (Bidirectional Encoder Representations from Transformers) with Graph Convolutional Networks (GCNs) to perform text classification, specifically in a transductive setting. The provided supplementary material details the datasets and baseline models used for evaluation.

While the core architecture and methodology of BertGCN are not described in the supplementary text, the combination of BERT and GCN suggests an approach that aims to leverage both the powerful contextual word and document representations learned by pre-trained transformers like BERT and the ability of GCNs to exploit structural relationships within the data. In the context of text classification, this structure is often represented as a graph, such as a word-document graph where nodes represent words and documents, and edges indicate co-occurrence or other relationships. A transductive setting implies that the model uses the entire dataset, including the test set (without labels), during the graph construction or message passing phase, allowing it to learn from the structure of the unlabeled data.

The supplementary material lists several standard datasets commonly used for text classification benchmarks:

20 Newsgroups (20NG): A well-known dataset for topic classification, split by date into training and test sets. This heterogeneous nature (various topics) makes it a good test case for classifier generalization.
R8 and R52: Subsets of the Reuters dataset, focusing on topic classification with 8 and 52 categories, respectively. These are standard benchmarks for evaluating text classification models on structured news data.
OHSUMED: A collection of medical abstracts categorized by disease. This dataset presents a more specialized domain and potentially longer documents, testing the model's ability to handle domain-specific language and content.
MR (Movie Review): A dataset specifically for binary sentiment classification. This evaluates the model's performance on sentiment analysis tasks, which often require understanding nuanced language.

The choice of baselines also provides insight into the paper's context:

TextGCN (Time of Flight Modulation of Intensity by Zero Effort on Larmor, 2019): A graph-based method that builds a single heterogeneous graph of words and documents and applies GCNs for classification. Using this as a baseline suggests BertGCN is compared against prior graph-based text classification techniques.
SGC (Simple Graph Convolution) (Simplifying Graph Convolutional Networks, 2019): A simplified GCN variant, indicating that the paper likely considers computational efficiency and the impact of non-linearities in graph processing for text tasks.
BERT (BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2018) and RoBERTa (RoBERTa: A Robustly Optimized BERT Pretraining Approach, 2019): Powerful transformer models that operate primarily on sequences. Comparing BertGCN against these shows the paper's objective to demonstrate the added value of incorporating graph structure alongside strong pre-trained LLMs.

Practical Implications & Implementation Considerations (Inferred):

Implementing a model like BertGCN would likely involve several key steps:

Text Preprocessing: Standard steps like tokenization, cleaning, and potentially stop word removal, depending on how the text is fed into the graph construction and the BERT model.
Graph Construction: Building a graph that represents the corpus. Common approaches for text classification include:
- Word-Document Graph: Nodes are words and documents. Edges represent word occurrences in documents (with weights like TF-IDF) or word-word relationships (like co-occurrence within a sliding window).
- Document Graph: Nodes are documents. Edges represent document similarity (e.g., cosine similarity of TF-IDF vectors or BERT embeddings) or links based on external metadata if available. The specific structure and edge weighting would be crucial implementation details likely described in the main paper.
Node Feature Initialization:
- Document nodes might be initialized with embeddings from BERT (e.g., the [CLS] token output).
- Word nodes might be initialized with pre-trained word embeddings or contextual embeddings aggregated from BERT outputs.
Model Integration: Combining BERT and GCN. This could be done in various ways:
- Sequential: Use BERT to get initial document/word embeddings, then apply GCN layers on the constructed graph using these embeddings as node features.
- Parallel/Joint: Process text sequences with BERT and graph structure with GCNs simultaneously, potentially fusing representations at multiple layers or using outputs from both models for final classification.
- BERT-informed Graph: Use BERT embeddings to inform graph construction or edge weights, then apply GCN.
GCN Layers: Applying graph convolution operations to propagate information across the graph, aggregating features from neighbors. The specific GCN variant (e.g., standard GCN, SGC) and number of layers are hyper-parameters.
Classification Head: A final layer (e.g., a linear layer) that takes the learned document node representations from the GCN (or fused representations) and outputs class probabilities.
Training: Training the combined model end-to-end using a classification loss (e.g., cross-entropy). In a transductive setting, the graph would include both training and test documents, and the loss would be calculated only on the training node labels.

Computational Requirements:

Implementing BertGCN would be computationally intensive due to the combination of large models:

BERT: Requires significant GPU memory and computation for forward/backward passes.
Graph Construction: Building and storing the graph, especially for large datasets like 20NG, can require substantial memory, particularly for word-document graphs. The adjacency matrix or list can be sparse but still large.
GCN: Graph convolution operations involve matrix multiplications with the adjacency matrix (or a normalized version), which can be computationally expensive, especially for large graphs with many nodes and edges.

Optimizations like using sparse matrix operations for the adjacency matrix and leveraging parallel processing on GPUs are essential. The size and density of the graph significantly impact the GCN computation cost.

In summary, BertGCN (BertGCN: Transductive Text Classification by Combining GCN and BERT, 2021), as suggested by its title and the supplementary material, is likely a model that enhances text classification by integrating the powerful feature extraction of BERT with the relational modeling capabilities of GCNs on a corpus-level graph structure, particularly for transductive learning tasks. Its practical implementation involves careful graph construction, feature initialization, model integration strategy, and consideration of the significant computational resources required.

PDF Markdown

Related Papers

GitHub

GitHub - ZeroRin/BertGCN (272 stars)