- The paper introduces μTC, a domain-independent and language-independent automated text categorization framework based on hyperparameter optimization.
- μTC frames text classification as a combinatorial optimization problem solved by meta-heuristics that search a configuration space of text transformations, tokenizers, weighting schemes, and a linear SVM.
- Experimental results show μTC performs competitively or achieves superior accuracy, attaining the best performance in 20 out of 30 datasets compared to existing methods on tasks like topic classification and sentiment analysis.
The paper introduces μTC, a domain-independent and language-independent automated text categorization framework based on hyperparameter optimization. The authors posit that many text-related tasks, including topic identification, spam filtering, user profiling, and sentiment analysis, can be framed as supervised learning problems amenable to text classification.
The core idea is to orchestrate a series of simple text transformations, tokenizers, and weighting schemes, along with a Support Vector Machine (SVM) classifier, to achieve effective text classification. The approach views the creation of effective text classifiers as a combinatorial optimization problem, where a meta-heuristic searches a space of configurations comprising various text transformations, tokenizers, and weighting procedures to identify a configuration that yields a highly effective text classifier. This model selection procedure aligns with the concept of hyper-parameter optimization.
Key components of the μTC framework include:
- A configuration space (T,G,H,Ψ) that defines the possible combinations of text transformation functions (T), tokenizer functions (G), weighting schemes (H), and classifiers (Ψ).
- A μTC graph (C,N), where the vertex set C represents the configuration space, and the edge set N defines the neighborhood of each vertex. The neighborhood function N connects similar configurations, facilitating local search-based meta-heuristics.
- A \textsf{score} function that evaluates the performance of a text classifier defined by a specific configuration, using training and test datasets. The \textsf{score} function typically employs metrics like F1, accuracy, precision, or recall to measure the classifier's quality.
- An optimization process that navigates the configuration graph (C,N) using a combination of meta-heuristics, such as Random Search and Hill Climbing, to find a near-optimal configuration.
The authors compared μTC against state-of-the-art methods across 30 datasets encompassing topic and polarity classification, spam detection, user profiling, and authorship attribution tasks. The experimental results demonstrate that μTC achieves competitive or superior performance compared to existing methods. Specifically, regarding accuracy, μTC attained the best performance in 20 datasets and competitive results in the remaining 10 datasets.
The configuration space consists of:
- Preprocessing functions T={Ti}. The authors considered sets of functions to handle hashtags (remove, group, identity), numbers (remove, group, identity), URLs (remove, group, identity), user mentions (remove, group, identity), diacritic symbols (remove, identity), duplicated symbols (remove, identity), punctuation (remove, identity), and lowercasing (lowercase, identity).
- Tokenizers G={Gi}. The authors use word n-grams, character n-grams, and skip-grams.
- Weighting schemes H. They use term frequency (TF) or the TFIDF as weight. They also consider a sequential list of filters \textsf{max-filter} and \textsf{min-filter}.
- Classifier Ψ. They used a singleton set populated with an SVM with a linear kernel.
The authors also investigated the impact of different text normalization stages on μTC's performance. They found that μTC achieves high performance even with raw text input, minimizing the need for sophisticated, language-dependent preprocessing steps. Additionally, the authors explored strategies for mitigating overfitting, such as k-fold cross-validation and binary partition, to ensure the robustness of the model selection process.
The experiments were run in an Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz with 32 threads and 192 GiB of RAM running CentOS 7.1 Linux. They implemented μTC on Python.
The Hamming distance over configurations is defined as follows:
dH(u,v)=i=1∑∣T∣+∣G∣+2Δ(ui,vi)
where:
- dH is the Hamming distance
- u and v are two configurations
- ∣T∣ is the number of text transformation functions
- ∣G∣ is the number of tokenizer functions
- Δ(a,b)=1 if a and b are the same function; 0 otherwise