An Automated Text Categorization Framework based on Hyperparameter Optimization (1704.01975v2)

Published 6 Apr 2017 in cs.CL, cs.AI, and stat.ML

Abstract: A great variety of text tasks such as topic or spam identification, user profiling, and sentiment analysis can be posed as a supervised learning problem and tackle using a text classifier. A text classifier consists of several subprocesses, some of them are general enough to be applied to any supervised learning problem, whereas others are specifically designed to tackle a particular task, using complex and computational expensive processes such as lemmatization, syntactic analysis, etc. Contrary to traditional approaches, we propose a minimalistic and wide system able to tackle text classification tasks independent of domain and language, namely microTC. It is composed by some easy to implement text transformations, text representations, and a supervised learning algorithm. These pieces produce a competitive classifier even in the domain of informally written text. We provide a detailed description of microTC along with an extensive experimental comparison with relevant state-of-the-art methods. mircoTC was compared on 30 different datasets. Regarding accuracy, microTC obtained the best performance in 20 datasets while achieves competitive results in the remaining 10. The compared datasets include several problems like topic and polarity classification, spam detection, user profiling and authorship attribution. Furthermore, it is important to state that our approach allows the usage of the technology even without knowledge of machine learning and natural language processing.

Citations (74)

View on Semantic Scholar

Summary

The paper introduces μTC, a domain-independent and language-independent automated text categorization framework based on hyperparameter optimization.
μTC frames text classification as a combinatorial optimization problem solved by meta-heuristics that search a configuration space of text transformations, tokenizers, weighting schemes, and a linear SVM.
Experimental results show μTC performs competitively or achieves superior accuracy, attaining the best performance in 20 out of 30 datasets compared to existing methods on tasks like topic classification and sentiment analysis.

The paper introduces $\mu$ TC, a domain-independent and language-independent automated text categorization framework based on hyperparameter optimization. The authors posit that many text-related tasks, including topic identification, spam filtering, user profiling, and sentiment analysis, can be framed as supervised learning problems amenable to text classification.

The core idea is to orchestrate a series of simple text transformations, tokenizers, and weighting schemes, along with a Support Vector Machine (SVM) classifier, to achieve effective text classification. The approach views the creation of effective text classifiers as a combinatorial optimization problem, where a meta-heuristic searches a space of configurations comprising various text transformations, tokenizers, and weighting procedures to identify a configuration that yields a highly effective text classifier. This model selection procedure aligns with the concept of hyper-parameter optimization.

Key components of the $\mu$ TC framework include:

A configuration space $(\mathcal{T}, \mathcal{G}, \mathcal{H}, {\Psi})$ that defines the possible combinations of text transformation functions ( $\mathcal{T}$ ), tokenizer functions ( $\mathcal{G}$ ), weighting schemes ( $\mathcal{H}$ ), and classifiers ( $\Psi$ ).
A $\mu$ TC graph $(\mathcal{C}, N)$ , where the vertex set $\mathcal{C}$ represents the configuration space, and the edge set $N$ defines the neighborhood of each vertex. The neighborhood function $N$ connects similar configurations, facilitating local search-based meta-heuristics.
A \textsf{score} function that evaluates the performance of a text classifier defined by a specific configuration, using training and test datasets. The \textsf{score} function typically employs metrics like F1, accuracy, precision, or recall to measure the classifier's quality.
An optimization process that navigates the configuration graph $(\mathcal{C}, N)$ using a combination of meta-heuristics, such as Random Search and Hill Climbing, to find a near-optimal configuration.

The authors compared $\mu$ TC against state-of-the-art methods across 30 datasets encompassing topic and polarity classification, spam detection, user profiling, and authorship attribution tasks. The experimental results demonstrate that $\mu$ TC achieves competitive or superior performance compared to existing methods. Specifically, regarding accuracy, $\mu$ TC attained the best performance in 20 datasets and competitive results in the remaining 10 datasets.

The configuration space consists of:

Preprocessing functions $\mathcal{T} = \{T_i\}$ . The authors considered sets of functions to handle hashtags (remove, group, identity), numbers (remove, group, identity), URLs (remove, group, identity), user mentions (remove, group, identity), diacritic symbols (remove, identity), duplicated symbols (remove, identity), punctuation (remove, identity), and lowercasing (lowercase, identity).
Tokenizers $\mathcal{G} = \{G_i\}$ . The authors use word n-grams, character n-grams, and skip-grams.
Weighting schemes $\mathcal{H}$ . They use term frequency (TF) or the TFIDF as weight. They also consider a sequential list of filters \textsf{max-filter} and \textsf{min-filter}.
Classifier $\Psi$ . They used a singleton set populated with an SVM with a linear kernel.

The authors also investigated the impact of different text normalization stages on $\mu$ TC's performance. They found that $\mu$ TC achieves high performance even with raw text input, minimizing the need for sophisticated, language-dependent preprocessing steps. Additionally, the authors explored strategies for mitigating overfitting, such as k-fold cross-validation and binary partition, to ensure the robustness of the model selection process.

The experiments were run in an Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz with 32 threads and 192 GiB of RAM running CentOS 7.1 Linux. They implemented $\mu$ TC on Python.

The Hamming distance over configurations is defined as follows:

$d_H(u, v) = \sum^{|\mathcal{T}|+|\mathcal{G}|+2}_{i=1} \Delta(u_i, v_i)$

where:

$d_H$ is the Hamming distance
$u$ and $v$ are two configurations
$|\mathcal{T}|$ is the number of text transformation functions
$|\mathcal{G}|$ is the number of tokenizer functions
$\Delta(a, b) = 1$ if $a$ and $b$ are the same function; 0 otherwise

PDF Markdown

An Automated Text Categorization Framework based on Hyperparameter Optimization (1704.01975v2)

Summary

Related Papers