Effective Use of Word Order for Text Categorization with Convolutional Neural Networks (1412.1058v2)

Published 1 Dec 2014 in cs.CL, cs.LG, and stat.ML

Abstract: Convolutional neural network (CNN) is a neural network that can make use of the internal structure of data such as the 2D structure of image data. This paper studies CNN on text categorization to exploit the 1D structure (namely, word order) of text data for accurate prediction. Instead of using low-dimensional word vectors as input as is often done, we directly apply CNN to high-dimensional text data, which leads to directly learning embedding of small text regions for use in classification. In addition to a straightforward adaptation of CNN from image to text, a simple but new variation which employs bag-of-word conversion in the convolution layer is proposed. An extension to combine multiple convolution layers is also explored for higher accuracy. The experiments demonstrate the effectiveness of our approach in comparison with state-of-the-art methods.

Authors (2)

Rie Johnson (8 papers)
Tong Zhang (570 papers)

Citations (870)

View on Semantic Scholar

Summary

The paper introduces seq-CNN and bow-CNN models that harness word order to directly embed high-dimensional text data for effective categorization.
The seq-CNN model reduced the IMDB sentiment classification error to 8.04%, further improved to 7.67% with a hybrid seq2-bow technique.
The bow-CNN demonstrated superior performance in topic categorization on RCV1, achieving micro-F1 of 84.0 and macro-F1 of 64.8 in multi-label settings.

Effective Use of Word Order for Text Categorization with Convolutional Neural Networks

Introduction

The application of Convolutional Neural Networks (CNNs) to text categorization signifies a strategic shift away from traditional methods that rely on bag-of-words (BoW) representations. This paper investigates the efficacy of CNNs in exploiting the inherent 1D structure of text data, specifically word order, for accurate text categorization. Unlike the conventional approach of feeding low-dimensional word vectors into neural networks, the proposed method applies CNNs directly to high-dimensional text data, allowing for the direct embedding of small text regions into feature vectors used for classification.

Methodology

The authors introduce two types of CNN architectures tailored for text data: seq-CNN and bow-CNN. The seq-CNN adapts the CNN framework from image to text by treating each word in a document as a pixel and respecting the sequential order of the words. On the other hand, bow-CNN represents each text region by a bag-of-words vector rather than a sequence, reducing the dimensionality and potentially improving computational efficiency.

The paper also explores the effectiveness of combining multiple convolution layers (parallel CNNs), each capturing different types of embeddings, to further enhance classification accuracy. This architecture enables the network to learn and leverage various forms of text regions concurrently.

Experimental Setup

The experiments were conducted on three distinct datasets: IMDB (movie reviews), Elec (electronics product reviews), and RCV1 (Reuters news articles) for both single-label and multi-label topic categorization. The CNN models were compared against traditional methods such as Support Vector Machines (SVM) with bag-of- $n$ -gram vectors and fully-connected neural networks. Additionally, the paper benchmarks performance against advanced baselines including NB-LM and previous CNN models tailored for text such as those utilizing word vectors.

Results and Analysis

Sentiment Classification

For sentiment classification tasks on IMDB and Elec datasets, seq-CNN consistently outperformed bow-CNN and standard SVM-based methods. Key configurations for seq-CNN included a region size of 3 with max-pooling over regions. Specifically, seq-CNN achieved an error rate of 8.04% on the IMDB dataset, surpassing the 8.13% recorded by NB-LM, and reaching a further improved error rate of 7.67% using a hybrid seq2-bow $n$ -CNN model. This indicates the model's ability to effectively utilize small, sentiment-bearing text segments for robust predictions.

Topic Categorization

In single-label topic categorization on the RCV1 dataset, bow-CNN demonstrated superior performance, with error rates lower than both seq-CNN and traditional SVM methods. This suggests the advantages of embedding larger text regions that adhere to topic context over small, sequential text regions. For multi-label categorization, bow-CNN achieved micro-F1 and macro-F1 scores of 84.0 and 64.8, respectively, surpassing the best results reported in LYRL04.

Efficiency

The training time for CNN models varied significantly depending on the GPU hardware and model complexity. However, training times were generally reasonable, with most models trained within a few hours, implying practical applicability in real-world scenarios.

Implications and Future Directions

The results substantiate that CNNs, when properly adapted to handle high-dimensional textual data, can significantly enhance text categorization tasks. By directly embedding high-dimensional one-hot vectors and employing sophisticated pooling strategies, CNNs can effectively capture and utilize word order dependencies that are often ignored by traditional BoW methods. The parallel CNN architecture's ability to amalgamate multiple types of embeddings further fortifies the robustness of the model, yielding state-of-the-art performance in several benchmark datasets.

Future research could explore optimizing CNN configurations, exploring more complex multi-layer architectures, and extending CNN applications to other NLP tasks. Additionally, integrating semi-supervised learning techniques and investigating the use of transfer learning from pre-trained large models like BERT or GPT could further enhance performance and generalization capabilities.

In conclusion, the effective use of CNNs in leveraging word order for text categorization, as proposed by Johnson and Zhang, marks a notable advancement in NLP, providing a new benchmark for text classification methodologies.

PDF Markdown