Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ALMs: Authorial Language Models for Authorship Attribution (2401.12005v2)

Published 22 Jan 2024 in cs.CL

Abstract: In this paper, we introduce an authorship attribution method called Authorial LLMs (ALMs) that involves identifying the most likely author of a questioned document based on the perplexity of the questioned document calculated for a set of causal LLMs fine-tuned on the writings of a set of candidate author. We benchmarked ALMs against state-of-art-systems using the CCAT50 dataset and the Blogs50 datasets. We find that ALMs achieves a macro-average accuracy score of 83.6% on Blogs50, outperforming all other methods, and 74.9% on CCAT50, matching the performance of the best method. To assess the performance of ALMs on shorter texts, we also conducted text ablation testing. We found that to reach a macro-average accuracy of 70%, ALMs needs 40 tokens on Blogs50 and 400 tokens on CCAT50, while to reach 60% ALMs requires 20 tokens on Blogs50 and 70 tokens on CCAT50.

Citations (2)

Summary

  • The paper introduces ALMs, which fine-tune LLMs on candidate texts and use perplexity minimization to determine authorship.
  • It demonstrates high accuracy on Blogs50 (83.6% macro-average) and competitive results on CCAT50 (74.9%), surpassing traditional stylometric models.
  • The study shows ALMs can reliably attribute authorship from short texts, requiring as few as 40 tokens to achieve practical accuracy.

Introduction

The paper under discussion proposes a novel method for authorship attribution, a task that has captured the interest of computational linguists for over a century. Authorship attribution involves determining the most probable author of a mysterious or disputed text by comparing its style with a range of candidates. Traditional stylometric techniques have encountered challenges, particularly as the number of candidate authors increases, document length decreases, and the volume of representative training data is limited. To address these limitations, the present paper introduces Authorial LLMs (ALMs), which leverage LLMs fine-tuned on candidate authors' writings to predict authorship with improved accuracy.

Methodology

The researchers delineate a three-step approach for authorship attribution. Initially, causal LLMs are fine-tuned on written samples from designated authors to generate separate models for each candidate, employing the GPT-2 framework over numerous epochs. In the succeeding phase, the perplexity of the questioned document is calculated over each of these fine-tuned models. Perplexity serves as a measure indicating how well a model can predict the sequence of the text, reflecting how 'surprising' the text is to the model. The text is finally attributed to the author whose LLM yields the lowest perplexity score. The authors evaluate their method on the widely-recognized CCAT50 and Blogs50 datasets, representing news articles and blog posts, respectively, to ensure a consistent register across texts.

Results

The ALMs method delivers impressive results, outperforming all other models tested on the Blogs50 dataset with an 83.6% macro-average accuracy and closely matching the best-performing model on CCAT50 with a 74.9% accuracy. This significant achievement points to the potential for perplexity-based approaches to revolutionize authorship attribution performance. Notably, the paper reveals that ALMs maintain robustness even in shorter texts, requiring as few as 40 tokens on Blogs50 and 400 tokens on CCAT50 to attain 70% macro-average accuracy, while only needing 20 and 70 tokens on each dataset, respectively, to reach 60% accuracy. This flexibility is essential in applications where the available text is limited.

Discussion

The paper posits that the superior results of ALMs stem from accessing token-level authorial features rather than relying on type-based metrics like function words or n-grams, common in traditional stylometric analysis. The presented approach parallels manual forensic linguistic analysis, focusing on detailed stylistic aspects of text. The authors also surmise that by escaping the constraints of topical content through a token-based analysis, the approach achieves a granularity that surpasses the capacity of type-based methods. Despite its success, the paper acknowledges the necessity for further exploration of hyper-parameter optimizations and potential dataset biases, calling for cautious implementation and comprehensive dataset evaluation prior to widespread application.