Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

97 tokens/sec

GPT-4o

53 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

104

Authorship Verification based on the Likelihood Ratio of Grammar Models (2403.08462v1)

Published 13 Mar 2024 in cs.CL and cs.LG

Abstract: Authorship Verification (AV) is the process of analyzing a set of documents to determine whether they were written by a specific author. This problem often arises in forensic scenarios, e.g., in cases where the documents in question constitute evidence for a crime. Existing state-of-the-art AV methods use computational solutions that are not supported by a plausible scientific explanation for their functioning and that are often difficult for analysts to interpret. To address this, we propose a method relying on calculating a quantity we call $\lambda_G$ (LambdaG): the ratio between the likelihood of a document given a model of the Grammar for the candidate author and the likelihood of the same document given a model of the Grammar for a reference population. These Grammar Models are estimated using $n$-gram LLMs that are trained solely on grammatical features. Despite not needing large amounts of data for training, LambdaG still outperforms other established AV methods with higher computational complexity, including a fine-tuned Siamese Transformer network. Our empirical evaluation based on four baseline methods applied to twelve datasets shows that LambdaG leads to better results in terms of both accuracy and AUC in eleven cases and in all twelve cases if considering only topic-agnostic methods. The algorithm is also highly robust to important variations in the genre of the reference population in many cross-genre comparisons. In addition to these properties, we demonstrate how LambdaG is easier to interpret than the current state-of-the-art. We argue that the advantage of LambdaG over other methods is due to fact that it is compatible with Cognitive Linguistic theories of language processing.

References (113)

Authors (5)

Andrea Nini (3 papers)
Oren Halvani (6 papers)
Lukas Graner (7 papers)
Valerio Gherardi (6 papers)
Shunichi Ishihara (1 paper)

Citations (1)

View on Semantic Scholar

Summary

The paper presents a novel likelihood ratio framework using grammar models to verify authorship with a focus on cognitive linguistic individuality.
It employs n-gram models of function tokens to build compact, data-efficient classifiers that outperform complex transformer-based approaches.
The method demonstrates robust cross-genre performance and improved forensic interpretability through calibrated log-likelihood ratios.

Authorship Verification Based on the Likelihood Ratio of Grammar Models

The paper "Authorship Verification based on the Likelihood Ratio of Grammar Models" introduces a novel method for authorship verification (AV) that employs a likelihood ratio framework utilizing grammar models. Authorship verification is the process of determining whether a given set of documents was authored by the same individual, a task that finds applications in forensic science, such as the analysis of incriminating or questioned documents. Unlike many existing AV methods that lack interpretability and scientific transparency, this approach emphasizes cognitive linguistic compatibility and interpretability.

Methodology

The proposed method, denoted as $\lambda_G$ , involves calculating the ratio of the likelihood of a document given a grammar model of the candidate author to that given a model of a reference population. These grammar models are built using n-gram LLMs focused solely on grammatical features, specifically function tokens, which include all function words, morphemes, punctuation marks, and abstract grammatical categories. The model does not require vast amounts of data to train yet claims to outperform more complex methods, including fine-tuned transformer networks.

The approach is grounded in several cognitive linguistic theories, such as the Principle of Linguistic Individuality, suggesting that no two individuals have identical grammars. As a probabilistic model, it leverages the entrenchment of language habits in an author’s procedural memory, thereby distinguishing authors based on their unique grammatical idiosyncrasies.

Results

The method's efficacy was evaluated against several baseline methods, including the established Impostors Method (\imOrg) and a neural LLM using a Siamese network. Twelve datasets were utilized for this comprehensive evaluation, ensuring variabilities in text type, genre, and length. The $\lambda_G$ approach demonstrated superior performance across these datasets, particularly excelling in cross-genre verification tasks without relying on topic-driven features. Notably, it maintained robust performance even when the reference corpus genre differed from the test set, showcasing exceptional generalizability in genre-agnostic environments.

The calibration of likelihood ratios into meaningful log-likelihood ratios ( $\Lambda_G$ ) involved logistic regression, which improved the interpretability and legal applicability of the results, essential in forensic contexts.

Implications and Future Directions

The research holds significant implications for both practical forensic applications and theoretical linguistic studies. Practically, the interpretability and adaptability of the model make it suitable for forensic investigations where transparency is paramount. Theoretically, it reinforces the understanding of language as a complex, individualized system deeply entrenched in cognitive processes rather than merely learned content.

This paper represents a pivot towards integrating cognitive linguistics with computational methods, emphasizing scientifically rooted approaches that enhance reliability and transparency. Future research might explore linguistic individuality further across different languages and expand the method's application to other cross-linguistic and cultural contexts. Moreover, developing entirely language-independent models could broaden the applicability of authorship verification frameworks globally.

This research underscores the viability of implementing cognitive linguistic theories in computational solutions, advancing the state-of-the-art in authorship verification with practical applications that demand both precision and interpretability.

PDF Markdown

Tweets

https://twitter.com/and_nini/status/1828814911826276470

https://twitter.com/and_nini/status/1798758580914622501

https://twitter.com/and_nini/status/1768213832869052458

https://twitter.com/and_nini/status/1781808824631824840

https://twitter.com/daforerog/status/1784427537071845726