Text Classification For Authorship Attribution Analysis (1310.4909v1)

Published 18 Oct 2013 in cs.DL, cs.CL, and cs.LG

Abstract: Authorship attribution mainly deals with undecided authorship of literary texts. Authorship attribution is useful in resolving issues like uncertain authorship, recognize authorship of unknown texts, spot plagiarism so on. Statistical methods can be used to set apart the approach of an author numerically. The basic methodologies that are made use in computational stylometry are word length, sentence length, vocabulary affluence, frequencies etc. Each author has an inborn style of writing, which is particular to himself. Statistical quantitative techniques can be used to differentiate the approach of an author in a numerical way. The problem can be broken down into three sub problems as author identification, author characterization and similarity detection. The steps involved are pre-processing, extracting features, classification and author identification. For this different classifiers can be used. Here fuzzy learning classifier and SVM are used. After author identification the SVM was found to have more accuracy than Fuzzy classifier. Later combined the classifiers to obtain a better accuracy when compared to individual SVM and fuzzy classifier.

Citations (22)

View on Semantic Scholar

Summary

The paper proposes using text classification techniques, specifically SVM and fuzzy learning classifiers, to address the problem of authorship attribution.
The study found that combining the SVM and fuzzy learning classifiers achieved a higher accuracy of 76% for authorship identification compared to 70% for the SVM alone.
This research demonstrates that combining different machine learning approaches can enhance the accuracy of authorship attribution, offering a practical method for tasks like plagiarism detection.

The paper "Text Classification for Authorship Attribution Analysis" addresses the problem of authorship attribution using machine learning techniques. The authors frame the problem as identifying the most likely author of a text given a set of candidate authors, which is relevant to plagiarism detection and resolving disputed authorship. They decompose the problem into author identification, author characterization, and similarity detection.

The methodology involves several stages:

Preprocessing: Tokenization and stemming are applied to convert the text into a suitable format for analysis.
Feature Extraction: Several features are extracted, including top-k frequent words, counts of punctuation marks and symbols, character count, sentence count, word count, and the ratio of character count to sentence count.
Classification: Fuzzy learning classifiers and Support Vector Machines (SVM) are used for classification.
Author Identification: The classifiers are used to identify the author of a given text.

The authors use the Java Development Kit (JDK) and NetBeans IDE for preprocessing and feature extraction, along with the MySQL JDBC Driver and edu.mit.jwi_2.1.4 libraries. MATLAB is used for classification.

The paper references several relevant works, including:

The use of author fuzzy fingerprints for authorship identification [1]. This involves extracting a fingerprint from a set of texts and using it to identify the author of a different text document.
Efficient computation of frequent and top-k elements in data streams [2] using a counter-based Space-Saving algorithm.
The role of statistical text analysis in authorship attribution [3], which uses quantitative methods to characterize an author's style numerically.
The use of authorship analysis in cybercrime investigation [4], which extracts style markers, content-specific features, and structural features to identify the authorship of illegal messages.
Combination of text classifiers [5] to leverage the strengths of different methods.

The classifiers used are:

Fuzzy Learning Classifier: This groups elements into fuzzy sets based on their characteristics. A membership function $\mu$ $μ$ indicates whether an individual is a member of a class.
- $\mu$ : membership function
Support Vector Machine (SVM) Classifier: This is a supervised learning model that classifies data using both linear and non-linear methods. A multi-SVM with an RBF kernel is used.
Combined Classifier: This combines the outputs of the fuzzy classifier and the SVM classifier to achieve higher accuracy.

The paper used a dataset of texts from different authors and tokenization methods, including words and punctuation, as well as additional stylometric features such as number of words, sentences, and characters. The accuracy of the classifiers was evaluated based on the percentage of correctly identified authors. The combined classifier achieved an accuracy of 76%, while the SVM classifier had an accuracy of 70%. The fuzzy classifier had lower accuracy than the SVM. The authors conclude that the combined classifier provides better accuracy than either individual classifier for authorship attribution.

PDF Markdown

Text Classification For Authorship Attribution Analysis (1310.4909v1)

Summary

Related Papers