Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Authorship Verification based on the Likelihood Ratio of Grammar Models (2403.08462v1)

Published 13 Mar 2024 in cs.CL and cs.LG

Abstract: Authorship Verification (AV) is the process of analyzing a set of documents to determine whether they were written by a specific author. This problem often arises in forensic scenarios, e.g., in cases where the documents in question constitute evidence for a crime. Existing state-of-the-art AV methods use computational solutions that are not supported by a plausible scientific explanation for their functioning and that are often difficult for analysts to interpret. To address this, we propose a method relying on calculating a quantity we call $\lambda_G$ (LambdaG): the ratio between the likelihood of a document given a model of the Grammar for the candidate author and the likelihood of the same document given a model of the Grammar for a reference population. These Grammar Models are estimated using $n$-gram LLMs that are trained solely on grammatical features. Despite not needing large amounts of data for training, LambdaG still outperforms other established AV methods with higher computational complexity, including a fine-tuned Siamese Transformer network. Our empirical evaluation based on four baseline methods applied to twelve datasets shows that LambdaG leads to better results in terms of both accuracy and AUC in eleven cases and in all twelve cases if considering only topic-agnostic methods. The algorithm is also highly robust to important variations in the genre of the reference population in many cross-genre comparisons. In addition to these properties, we demonstrate how LambdaG is easier to interpret than the current state-of-the-art. We argue that the advantage of LambdaG over other methods is due to fact that it is compatible with Cognitive Linguistic theories of language processing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (113)
  1. Authorship Verification of Yorùbá Blog Posts using Character N-grams. In 2020 International Conference in Mathematics, Computer Engineering and Computer Science (ICMCECS), pages 1–6, March 2020.
  2. Expressing evaluative opinions: a position statement. Science and Justice, 51(1):1–2, March 2011.
  3. Statistics and the Evaluation of Evidence for Forensic Scientists. Wiley, Chichester, 01 2021. DOI: 10.1002/9781119245438.
  4. K. A. Apoorva and S. Sangeetha. Deep Neural Network and Model-based Clustering Technique for Forensic Electronic Mail Author Attribution. SN Applied Sciences, 3(3):348, February 2021.
  5. Shlomo Engelson Argamon. Computational Forensic Authorship Analysis: Promises and Pitfalls. Language and Law / Linguagem e Direito, 5(2):7–37, 2018.
  6. American Statistical Association. American Statistical Association Position on Statistical Statements for Forensic Evidence. Technical report, American Statistical Association (ASA), 2019.
  7. Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 11(3):121–132, 1996.
  8. Douglas Bagnall. Author Identification Using Multi-headed Recurrent Neural Networks. In Linda Cappellato, Nicola Ferro, Gareth J. F. Jones, and Eric San Juan, editors, Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8-11, 2015., volume 1391 of CEUR Workshop Proceedings. CEUR-WS.org, 2015.
  9. Language Is a Complex Adaptive System: Position Paper. Language Learning, 59:1–26, 2009. Citation Key: Beckner2009.
  10. Bias Analysis and Mitigation in the Evaluation of Authorship Verification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6301–6306, Florence, Italy, July 2019. Association for Computational Linguistics.
  11. Bias Analysis and Mitigation in the Evaluation of Authorship Verification. In Anna Korhonen, David R. Traum, and Lluís Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 6301–6306. Association for Computational Linguistics, 2019.
  12. José Nilo G. Binongo. Who Wrote the 15th Book of Oz? An Application of Multivariate Analysis to Authorship Attribution. CHANCE, 16(2):9–17, 2003.
  13. The importance of suppressing domain style in authorship analysis, 2020.
  14. Similarity Learning for Authorship Verification in Social Media. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2457–2461, 2019.
  15. Likelihood ratios for categorical evidence: Comparison of lr models applied to gunshot residue data. Law, Probability and Risk, 16(2-3):71–90, 09 2017.
  16. Different likelihood ratio approaches to evaluate the strength of evidence of mdma tablet comparisons. Forensic Science International, 191(1-3):42–51, 10 2009.
  17. Authorship Verification for Short Messages Using Stylometry. In 2013 International Conference on Computer, Information and Telecommunication Systems (CITS), pages 1–6, May 2013.
  18. Authorship Verification of E-Mail and Tweet Messages Applied for Continuous Authentication. Journal of Computer and System Sciences, 81(8):1429 – 1440, 2015.
  19. Authorship Verification using Deep Belief Network Systems. Int. J. Communication Systems, 30(12), 2017.
  20. Niko Brümmer and Johan du Preez. Application-independent evaluation of speaker detection. Computer Speech & Language, 20(2):230–275, 2006. Odyssey 2004: The speaker and Language Recognition Workshop.
  21. Joan Bybee. Language, Usage and Cognition. Cambridge University Press, Cambridge, UK, 2010.
  22. Authorship Verification, Average Similarity Analysis. In Proceedings of the International Conference Recent Advances in Natural Language Processing, pages 84–90. INCOMA Ltd. Shoumen, BULGARIA, 2015.
  23. An introductory guide to evaluative reporting in forensic science. Australian Journal of Forensic Sciences, 51(sup1):S247–S251, February 2019.
  24. C.E. Chaski. Empirical evaluations of language-based author identification techniques. Forensic Linguistics, 8(1):1–65, 2001.
  25. An Empirical Study of Smoothing Techniques for Language Modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, ACL ’96, page 310–318, USA, 1996. Association for Computational Linguistics.
  26. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359–393, 10 1999.
  27. Authorship Similarity Detection from Email Messages. In Proceedings of the 7th International Conference on Machine Learning and Data Mining in Pattern Recognition, MLDM’11, pages 375–386, Berlin, Heidelberg, 2011. Springer-Verlag.
  28. The now-or-never bottleneck: A fundamental constraint on language. Behavioral and Brain Sciences, 39:e62, 04 2016. Publisher: Cambridge University Press Citation Key: Christiansen2016.
  29. Using subsampling to estimate the strength of handwriting evidence via score-based likelihood ratios. Forensic Science International, 216(1-3):146–157, 03 2012.
  30. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  31. A Visualizable Evidence-Driven Approach for Authorship Attribution. ACM Trans. Inf. Syst. Secur., 17(3), March 2015.
  32. Dagmar Divjak. Frequency in language: memory, attention and learning. Cambridge University Press, Cambridge, UK, 2019.
  33. Ewa Dąbrowska. Language as a Phenomenon of the Third Kind. Cognitive Linguistics, 31(2):213–229, 2020.
  34. Methodological Guidelines for Best Practice in Forensic Semiautomatic and Automatic Speaker Recognition. Verlag für Polizeiwissenschaft, 2015.
  35. The impact of the principles of evidence interpretation on the structure and content of statements. Science & Justice, 40(4):233–239, 10 2000.
  36. Working Notes for CLEF 2013 Conference, Valencia, Spain, September 23–26, 2013, volume 1179 of CEUR Workshop Proceedings. CEUR-WS.org, 2014.
  37. Language models and fusion for authorship attribution. Information Processing & Management, 56(6):102061, 11 2019.
  38. On Textual Analysis and Machine Learning for Cyberstalking Detection. Datenbank-Spektrum, 16(2):127–135, 2016.
  39. Chunking mechanisms in human learning. Trends in Cognitive Sciences, 5(6):236–243, 2001. Citation Key: Gobet2001.
  40. Recent Trends in Digital Text Forensics and its Evaluation. In Pamela Forner, Henning M = üller, Roberto Paredes, Paolo Rosso, and Benno Stein, editors, Information Access Evaluation. Multilinguality, Multimodality, and Visualization, pages 282–302, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
  41. Oren Halvani. Practice-Oriented Authorship Verification. PhD thesis, Technical University of Darmstadt, Germany, 2021.
  42. POSNoise: An Effective Countermeasure Against Topic Biases in Authorship Analysis. In Proceedings of the 16th International Conference on Availability, Reliability and Security, ARES ’21, New York, NY, USA, 2021. Association for Computing Machinery.
  43. Cross-Domain Authorship Verification Based on Topic Agnostic Features. In Linda Cappellato, Carsten Eickhoff, Nicola Ferro, and Aurélie Névéol, editors, Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, volume 2696 of CEUR Workshop Proceedings. CEUR-WS.org, 2020.
  44. TAVeer: An Interpretable Topic-Agnostic Authorship Verification Method. In Melanie Volkamer and Christian Wressnegger, editors, ARES 2020: The 15th International Conference on Availability, Reliability and Security, Virtual Event, Ireland, August 25-28, 2020, pages 41:1–41:10. ACM, 2020.
  45. Authorship Verification in the Absence of Explicit Features and Thresholds. In Gabriella Pasi, Benjamin Piwowarski, Leif Azzopardi, and Allan Hanbury, editors, Advances in Information Retrieval, pages 454–465. Springer International Publishing, 2018.
  46. On the Usefulness of Compression Models for Authorship Verification. In Proceedings of the 12th International Conference on Availability, Reliability and Security, ARES ’17, pages 54:1–54:10, New York, NY, USA, 2017. ACM.
  47. Assessing the Applicability of Authorship Verification Methods. In Proceedings of the 14th International Conference on Availability, Reliability and Security, ARES 2019, Canterbury, UK, August 26-29, 2019, pages 38:1–38:10. ACM, 2019.
  48. Shunichi Ishihara. A Forensic Authorship Classification in SMS Messages: A Likelihood Ratio Based Approach Using N-gram. In Diego Molla and David Martinez, editors, Proceedings of the Australasian Language Technology Association Workshop 2011, pages 47–56, Canberra, Australia, December 2011.
  49. Shunichi Ishihara. Score-based likelihood ratios for linguistic text evidence with a bag-of-words model. Forensic Science International, 327:110980, 2021. Publisher: Elsevier.
  50. Shunichi Ishihara. Weight of authorship evidence with multiple categories of stylometric features: A multinomial-based discrete model. Science & Justice, 63(2):181–199, March 2023.
  51. Likelihood ratio estimation for authorship text evidence: An empirical comparison of score- and feature-based methods. Forensic Science International, 334:111268, May 2022.
  52. Validation in Forensic Text Comparison: Issues and Opportunities. Languages, 9(2):47, February 2024. Number: 2 Publisher: Multidisciplinary Digital Publishing Institute.
  53. Estimating the Strength of Authorship Evidence with a Deep-Learning-Based Approach. In Pradeesh Parameswaran, Jennifer Biggs, and David Powers, editors, Proceedings of the The 20th Annual Workshop of the Australasian Language Technology Association, pages 183–187, Adelaide, Australia, December 2022. Australasian Language Technology Association.
  54. Patrick Juola. Verifying authorship for forensic purposes: A computational protocol and its validation. Forensic Science International, page 110824, 05 2021. Publisher: Elsevier.
  55. Overview of the Author Identification Task at PAN 2013. In Forner et al. [36].
  56. Mike Kestemont. Function Words in Authorship Attribution. From Black Magic to Theory? In Anna Feldman, Anna Kazantseva, and Stan Szpakowicz, editors, Proceedings of the 3rd Workshop on Computational Linguistics for Literature (CLFL), pages 59–66, Gothenburg, Sweden, April 2014. Association for Computational Linguistics.
  57. Overview of the Cross-Domain Authorship Verification Task at PAN 2020. In Linda Cappellato, Carsten Eickhoff, Nicola Ferro, and Aurélie Névéol, editors, Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, volume 2696 of CEUR Workshop Proceedings. CEUR-WS.org, 2020.
  58. Authenticating the writings of julius caesar. Expert Systems With Applications, 63:86–96, 2016.
  59. A Slightly-Modified GI-Based Author-Verifier with Lots of Features (ASGALF). In Linda Cappellato, Nicola Ferro, Martin Halvey, and Wessel Kraaij, editors, Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15-18, 2014., volume 1180 of CEUR Workshop Proceedings, pages 977–983. CEUR-WS.org, 2014.
  60. Improved score aggregation for authorship verification. Knowledge and Information Systems, 12 2022.
  61. Authorship Verification with Personalized Language Models. In Petr Sojka, Ivan Kopeček, Karel Pala, and Aleš Horák, editors, Text, Speech, and Dialogue, pages 248–256, Cham, 2020. Springer International Publishing.
  62. R. Kneser and H. Ney. Improved Backing-off for M-gram Language Modeling. In 1995 International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 181–184 vol.1, 1995.
  63. UniNE at CLEF 2015 Author Identification: Notebook for PAN at CLEF 2015. In CLEF (Working Notes), volume 1391 of CEUR Workshop Proceedings. CEUR-WS.org, 2015.
  64. A Simple and Efficient Algorithm for Authorship Verification. Journal of the Association for Information Science and Technology, 68(1):259–269, 2017.
  65. Authorship Verification as a One-Class Classification Problem. In Carla E. Brodley, editor, Machine Learning, Proceedings of the Twenty-first International Conference (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004, volume 69 of ACM International Conference Proceeding Series. ACM, 2004.
  66. Authorship Attribution in the Wild. Language Resources and Evaluation, 45(1):83–94, 2011.
  67. The “Fundamental Problem” of Authorship Attribution. English Studies, 93(3):284–291, 2012.
  68. Measuring Differentiability: Unmasking Pseudonymous Authors. J. Mach. Learn. Res., 8:1261–1276, December 2007.
  69. Automatically Identifying Pseudepigraphic Texts. In EMNLP, pages 1449–1454. ACL, 2013.
  70. Determining If Two Documents are Written by the Same Author. Journal of the Association for Information Science and Technology, 65(1):178–187, 2014.
  71. Ronald W. Langacker. Foundations of cognitive grammar, volume 1. Stanford University Press, Stanford, CA, 1987.
  72. Marina Litvak. Deep Dive into Authorship Verification of Email Messages with Convolutional Neural Network. In Juan Antonio Lossio-Ventura, Denisse Muñante, and Hugo Alatrista-Salas, editors, Information Management and Big Data, pages 129–136, Cham, 2019. Springer International Publishing.
  73. "science", "common sense”, and dna evidence: a legal controversy about the public understanding of science. Public Understanding of Science, 12(1):83–103, 01 2003.
  74. Image-Based Recommendations on Styles and Substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, pages 43–52, New York, NY, USA, 2015. ACM.
  75. A guideline for the validation of likelihood ratio methods used for forensic evidence evaluation. Forensic Science International, 276:142–153, July 2017.
  76. Clg authorship analytics: a library for authorship verification. International Journal of Digital Humanities, 10 2022.
  77. Geoffrey Stewart Morrison. Forensic voice comparison and the paradigm shift. Science and Justice, 49(4):298–308, 12 2009. PMID: 20120610.
  78. Geoffrey Stewart Morrison. Measuring the validity and reliability of forensic likelihood-ratio systems. Science & Justice, 51(3):91–98, September 2011.
  79. Inference in an Authorship Problem. Journal of the American Statistical Association, 58:275–309, 1963.
  80. Exploiting Linguistic Style as a Cognitive Biometric for Continuous Verification. In 2018 International Conference on Biometrics, ICB 2018, Gold Coast, Australia, February 20-23, 2018, pages 270–276. IEEE, 2018.
  81. Quantifying the weight of evidence from a forensic fingerprint comparison: A new paradigm. Journal of the Royal Statistical Society Series A: Statistics in Society, 175(2):371–415, 04 2012.
  82. Andrea Nini. A Theory of Linguistic Individuality for Authorship Analysis. Elements in Forensic Linguistics. Cambridge University Press, Cambridge, UK, 2023.
  83. A Graph Model Based Author Attribution Technique for Single-Class e-Mail Classification. In 2015 IEEE/ACIS 14th International Conference on Computer and Information Science (ICIS), pages 191–196, June 2015.
  84. European Network of European Network of Forensic Science Institutes. Enfsi guideline for evaluative reporting in forensic science. Technical report, ., 2015.
  85. Association of Forensic Science Providers. Standards for the formulation of evaluative forensic science expert opinion. Science & Justice, 49(3):161–164, 09 2009.
  86. Using Conjunctions and Adverbs for Author Verification. J. UCS, 14(18):2967–2981, 2008.
  87. An Improved Impostors Method for Authorship Verification. In Gareth J. F. Jones, Séamus Lawless, Julio Gonzalo, Liadh Kelly, Lorraine Goeuriot, Thomas Mandl, Linda Cappellato, and Nicola Ferro, editors, Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Dublin, Ireland, September 11-14, 2017, Proceedings, volume 10456 of Lecture Notes in Computer Science, pages 138–144. Springer, 2017.
  88. Improved Algorithms for Extrinsic Author Verification. Knowledge and Information Systems, Oct 2019.
  89. Improved algorithms for extrinsic author verification. Knowledge and Information Systems, 62(5):1903–1921, 05 2020.
  90. A decade of shared tasks in digital text forensics at pan. In Leif Azzopardi, Benno Stein, Norbert Fuhr, Philipp Mayr, Claudia Hauff, and Djoerd Hiemstra, editors, Advances in Information Retrieval, pages 291–300, Cham, 2019. Springer International Publishing.
  91. Information-Theoretical Assessment of the Performance of Likelihood Ratio Computation Methods. Journal of Forensic Sciences, 58(6):1503–1518, 2013. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/1556-4029.12233.
  92. Forensic Science Regulator. Forensic Science Regulator Codes of Practice and Conduct Development of Evaluative Opinions. Technical report, Forensic Science Regulator, 2021.
  93. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China, November 2019. Association for Computational Linguistics.
  94. Interpreting Evidence: Evaluating Forensic Science in the Courtroom. Wiley, Chichester, 08 2016. DOI: 10.1002/9781118492475.
  95. Evolution of the PAN Lab on Digital Text Forensics. In Nicola Ferro and Carol Peters, editors, Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF, volume 41 of The Information Retrieval Series, pages 461–485. Springer, 2019.
  96. Effects of Age and Gender on Blogging. In AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pages 199–205. AAAI, 2006.
  97. Hans-Jörg Schmid. A blueprint of the entrenchment-and-conventionalization model. Yearbook of the German Cognitive Linguistics Association, 3(1):3–25, 2015. Citation Key: Schmid2015.
  98. Shachar Seidman. Authorship Verification Using the Impostors Method Notebook for PAN at CLEF 2013. In Forner et al. [36].
  99. Authorship Attribution with Topic Models. Comput. Linguistics, 40(2):269–310, 2014.
  100. John. Sinclair. Corpus, concordance, collocation. Oxford University Press, Oxford, 1991.
  101. Sockpuppet Detection in Wikipedia: A Corpus of Real-World Deceptive Writing for Linking Identities. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 1355–1358, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA).
  102. Efstathios Stamatatos. Authorship Verification: A Review of Recent Advances. Research in Computing Science, 123(1):9–25, December 2016.
  103. Efstathios Stamatatos. Authorship Attribution Using Text Distortion. In Proceedings of the 15th Conference of the European Chapter of the Association for the Computational Linguistics, EACL 2017, April 3-7, 2017, Valencia, Spain. The Association for Computer Linguistics, 2017.
  104. Overview of the Authorship Verification Task at PAN 2022. In Guglielmo Faggioli, Nicola Ferro, Allan Hanbury, and Martin Potthast, editors, Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, Bologna, Italy, September 5th - to - 8th, 2022, volume 3180 of CEUR Workshop Proceedings, pages 2301–2313. CEUR-WS.org, 2022.
  105. Meta Analysis within Authorship Verification. In 19th International Workshop on Database and Expert Systems Applications (DEXA 2008), 1-5 September 2008, Turin, Italy, pages 34–39. IEEE Computer Society, 2008.
  106. Ariel Stolerman. Authorship Verification. PhD thesis, 2015. UMI Dissertations Publishing 2015.
  107. Webis TripAdvisor Corpus 2014 (Webis-Tripad-14). Zenodo, 2015.
  108. Michael T Ullman. Contributions of memory circuits to language: the declarative/procedural model. Cognition, 92(1-2):231–270, 05 2004.
  109. Michael T. Ullman. Chapter 76 - The Declarative/Procedural Model: A Neurobiological Model of Language Learning, Knowledge, and Use, pages 953–968. Academic Press, San Diego, 2016.
  110. David A. van Leeuwen. ROC: Compute structures to compute ROC and DET plots and metrics for 2-class classifiers, 2015. Publisher: R package.
  111. David A. van Leeuwen and Niko Brümmer. An introduction to application-independent evaluation of speaker recognition systems. In Christian Müller, editor, Speaker Classification I: Fundamentals, Features, and Methods, pages 330–353. Springer-Verlag, Berlin, Heidelberg, 2007.
  112. David Wright. Using Word n-grams to Identify Authors and Idiolects: A Corpus Approach to a Forensic Linguistic Problem. International Journal of Corpus Linguistics, 22(2):212–241, 2017.
  113. G. Zadora. Evaluation of evidence value of glass fragments by likelihood ratio and bayesian network approaches. Analytica Chimica Acta, 642(1-2):279–290, 05 2009.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Andrea Nini (3 papers)
  2. Oren Halvani (6 papers)
  3. Lukas Graner (7 papers)
  4. Valerio Gherardi (6 papers)
  5. Shunichi Ishihara (1 paper)
Citations (1)

Summary

  • The paper presents a novel likelihood ratio framework using grammar models to verify authorship with a focus on cognitive linguistic individuality.
  • It employs n-gram models of function tokens to build compact, data-efficient classifiers that outperform complex transformer-based approaches.
  • The method demonstrates robust cross-genre performance and improved forensic interpretability through calibrated log-likelihood ratios.

Authorship Verification Based on the Likelihood Ratio of Grammar Models

The paper "Authorship Verification based on the Likelihood Ratio of Grammar Models" introduces a novel method for authorship verification (AV) that employs a likelihood ratio framework utilizing grammar models. Authorship verification is the process of determining whether a given set of documents was authored by the same individual, a task that finds applications in forensic science, such as the analysis of incriminating or questioned documents. Unlike many existing AV methods that lack interpretability and scientific transparency, this approach emphasizes cognitive linguistic compatibility and interpretability.

Methodology

The proposed method, denoted as λG\lambda_G, involves calculating the ratio of the likelihood of a document given a grammar model of the candidate author to that given a model of a reference population. These grammar models are built using n-gram LLMs focused solely on grammatical features, specifically function tokens, which include all function words, morphemes, punctuation marks, and abstract grammatical categories. The model does not require vast amounts of data to train yet claims to outperform more complex methods, including fine-tuned transformer networks.

The approach is grounded in several cognitive linguistic theories, such as the Principle of Linguistic Individuality, suggesting that no two individuals have identical grammars. As a probabilistic model, it leverages the entrenchment of language habits in an author’s procedural memory, thereby distinguishing authors based on their unique grammatical idiosyncrasies.

Results

The method's efficacy was evaluated against several baseline methods, including the established Impostors Method (\imOrg) and a neural LLM using a Siamese network. Twelve datasets were utilized for this comprehensive evaluation, ensuring variabilities in text type, genre, and length. The λG\lambda_G approach demonstrated superior performance across these datasets, particularly excelling in cross-genre verification tasks without relying on topic-driven features. Notably, it maintained robust performance even when the reference corpus genre differed from the test set, showcasing exceptional generalizability in genre-agnostic environments.

The calibration of likelihood ratios into meaningful log-likelihood ratios (ΛG\Lambda_G) involved logistic regression, which improved the interpretability and legal applicability of the results, essential in forensic contexts.

Implications and Future Directions

The research holds significant implications for both practical forensic applications and theoretical linguistic studies. Practically, the interpretability and adaptability of the model make it suitable for forensic investigations where transparency is paramount. Theoretically, it reinforces the understanding of language as a complex, individualized system deeply entrenched in cognitive processes rather than merely learned content.

This paper represents a pivot towards integrating cognitive linguistics with computational methods, emphasizing scientifically rooted approaches that enhance reliability and transparency. Future research might explore linguistic individuality further across different languages and expand the method's application to other cross-linguistic and cultural contexts. Moreover, developing entirely language-independent models could broaden the applicability of authorship verification frameworks globally.

This research underscores the viability of implementing cognitive linguistic theories in computational solutions, advancing the state-of-the-art in authorship verification with practical applications that demand both precision and interpretability.