Classification Analysis Of Authorship Fiction Texts in The Space Of Semantic Fields (1210.5965v1)

Published 22 Oct 2012 in cs.CL

Abstract: The use of naive Bayesian classifier (NB) and the classifier by the k nearest neighbors (kNN) in classification semantic analysis of authors' texts of English fiction has been analysed. The authors' works are considered in the vector space the basis of which is formed by the frequency characteristics of semantic fields of nouns and verbs. Highly precise classification of authors' texts in the vector space of semantic fields indicates about the presence of particular spheres of author's idiolect in this space which characterizes the individual author's style.

Citations (5)

View on Semantic Scholar

Summary

The paper introduces semantic field vectors as a low-dimensional basis for capturing distinct authorial styles in fiction texts.
It evaluates Naive Bayes and kNN classifiers using macro-averaged precision and recall metrics across a dataset of 503 works by 17 authors.
The study suggests that hybrid ensemble models could further enhance text classification and broaden applications in multilingual and diverse datasets.

Analyzing Classification Techniques for Authorship Attribution in Semantic Field Vector Spaces

The research conducted in the paper by Bohdan Pavlyshenko investigates the application of classification techniques, specifically the Naive Bayesian (NB) classifier and the k-Nearest Neighbors (kNN) classifier, to authorship attribution in the domain of English fiction texts. The paper focuses on analyzing the corpus by representing texts in a lower-dimensional vector space, utilizing the frequency characteristics of semantic fields of nouns and verbs as features. This methodological approach promises a streamlined and potentially effective strategy for distinguishing individual authorial styles, or idiolects, based on semantic properties, rather than relying on high-dimensional keyword frequency vectors.

Methodology and Design

The research is situated within the lexical semantic framework and leverages the WordNet semantic network to define semantic fields comprising synonyms and other related lexemes. The core proposition is that the semantic field frequency vector provides a reduced dimensionality basis for effectively conducting text classification tasks. Here, each text document is articulated in the semantic vector space, which allows for classification either into pre-defined author categories or novel attributions via exploratory text analysis.

The authors employed two well-established classifiers for evaluation:

Naive Bayesian Classifier: Utilizes Bayes' theorem for probabilistic classification, where the assumption of conditional independence allows for computational tractability though potentially at the cost of reducing classification dependability.
k-Nearest Neighbors (kNN) Classifier: This instance-based classifier relies on distance measures, specifically Euclidean, to classify documents based on the closest training samples in the feature space.

Experimental Results

The paper reports on diverse numerical results obtained from a dataset comprising 503 literary works from 17 authors. Among the findings, Bayesian classification across all semantic fields yielded macro-averaged precision (Pr) and recall (Rc) values of 0.7066 and 0.6952, respectively. The results exhibited variability when analyses were restricted to only nouns or verbs, underscoring the importance of comprehensive semantic representations.

The kNN classifier, with k=5, achieved Pr and Rc of 0.6119 and 0.6045 respectively, whereas with k=1, the values shifted to 0.5748 and 0.6071. The classifiers presented fluctuating effectiveness across different authors, displaying how the unique semantic field configurations can influence classifier performance for specific authorial voices.

Implications and Future Research

The findings underscore the value of semantic field vectors in text classification tasks, especially in authorship attribution where capturing nuanced aspects of writing style is pivotal. However, the variability in classifier performance across authors signals that incorporating semantic fields into text categorization may benefit from ensemble classification strategies that harness the strengths of both Bayesian and kNN approaches.

As the field of AI continues to evolve, future research might explore hybrid models that uniquely integrate semantic dimensions with traditional syntactic and lexical features, possibly employing deep learning architectures for feature extraction. Additionally, extending this framework to multi-lingual datasets and non-fictional texts could further validate the versatility and robustness of semantic field-based text classification methodologies.

The implications of this research potentially extend beyond authorship attribution to broader applications in natural language understanding and computational linguistics, providing insights into the automated interpretation of style and content in diverse textual forms.