Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 180 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Document Author Classification Using Parsed Language Structure (2403.13253v1)

Published 20 Mar 2024 in cs.CL and eess.AS

Abstract: Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used, for example, to determine authorship of all of \emph{The Federalist Papers}. Such methods may be useful in more modern times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship using grammatical structural information extracted using a statistical natural language parser. This paper provides a proof of concept, testing author classification based on grammatical structure on a set of "proof texts," The Federalist Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted from the statistical natural language parser were explored: all subtrees of some depth from any level; rooted subtrees of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the features into a lower dimensional space. Statistical experiments on these documents demonstrate that information from a statistical parser can, in fact, assist in distinguishing authors.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. R. Lord, “de Morgan and the Statistical Study of Literary Style,” Biometrica, vol. 3, p. 282, 1958.
  2. A. Morton, Literary Detection. New York: Charles Scribner’s Sons, 1978.
  3. T. Mendenhall, “A Mechanical Solution of a Literary Problem,” Popular Science Monthly, 1901.
  4. C. D. Chretien, “A Statistical Method for Determining Authorship: The Junius Letters,” Languages, vol. 40, pp. 95–90, 1964.
  5. D. Wishart and S. V. Leach, “A Multivariate Analysis of Platonic Prose Rhythm,” Computer Studies in the Humanities and Verbal Behavior, vol. 3, no. 2, pp. 109–125, 1972.
  6. C. S. Brinegar, “Mark Twain and the Quintis Curtis Snodgrass Letters: A Statistical Test of Authorship,” Journal of the Americal Statistical Association, vol. 53, p. 85, 1963.
  7. F. Mosteller and D. Wallace, Inference and Disputed Authorship: The Federalist. Reading, MA: Addison Wesley, 1964.
  8. P. Hanus and J. Hagenauer, “Information Theory Helps Historians,” IEEE Information Theory Society Newsletter, vol. 55, p. 8, Sept. 2005.
  9. D. Holmes, “The analysis of literary style — a review,” J. Royal Statistical Society, Series A, vol. 148, no. 4, pp. 328–341, 1985.
  10. J. L. Hilton, “On Verifying Wordprint Studies: Book of Mormon Authorship,” Brigham Young University Studies, 1990.
  11. D. Holmes, “A Stylometric Analysis of Mormon Scriptures and Related Texts,” Journal of the Royal Statistical Society, A, vol. 155, pp. 91–120, 1992.
  12. K. Luyckx and W. Daelemans, “Shallow text analysis and machine learning for authorship attribution,” in Proceedings of the Fifteenth Meeting of Computational Linguistics in the Netherlands, pp. 149–160, 2005.
  13. J. Grieve, “Quantitative authorship attribution: an evaluation of techniques,” Liter. Linguist. Comput., vol. 22, no. 3, pp. 251–270, 2007.
  14. F. Iqbal, H. Binsalleeh, B. Fung, and M. Debbabi, “A unified data mining solution for authorship analysis in anonymous textual communication,” Inform. Sci, vol. 231, pp. 98–112, 2007.
  15. E. Stamatos, “A survey of modern authorship attribution methods,” J. Am. Soc. Inform. Sci. Technol., vol. 60, no. 3, pp. 538–556, 2009.
  16. C. Zhang, X. Wu, Z. Niu, and W. Ding, “Authorship identification from unstructured texts,” Knowledge-Based Systems, vol. 66, pp. 99–111, 2014.
  17. I. Jolliffe, Principal Component Analysis. New York: Springer-Verlag, 1986.
  18. Springer, 2022.
  19. Springer, 2017.
  20. D. Klein and C. D. Manning, “Accurate unlexicalized parsing,” in Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 423–430, 2003. https://doi.org/10.3115/1075096.1075150.
  21. M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, “Building a Large Annotated Corpus of English: The Penn Treebank,” Computational Linguisistics, vol. 19, no. 2, pp. 313–330, 1993.
  22. A. Taylor, M. Marcus, and B. Santorini, “The Penn Treebank: An Overview.” https://www.researchgate.net/publication/2873803_The_Penn_Treebank_An_overview, 2003.
  23. D. Klein and C. D. Mannning, “Accurate unlexicalized parsing,” in Proceedings of the 41st Meeting of the Association for Computational Linguisistics, pp. 423–430, 2003.
  24. T. S. N. L. Group, “Software: Stanford parser.” https://nlp.stanford.edu/software/lex-parser.html, 2020.
  25. Upper Saddle River, NJ: Prentice-Hall, 2009.
  26. P. Howland, M. Jeon, and H. Park, “Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition,” SIAM J. Matrix Anal. Appl., vol. 25, no. 1, pp. 165–179, 2003.
  27. T. K. Moon, P. Howland, and J. H. Gunther, “Document author classification using generalized discriminant analysis,” in SIAM Conference on Text Mining, (Baltimore, MD), May 23–25 2006.
  28. P. Howland, J. Wang, and H. Park, “Solving the small sample size problem in face recognition using generalized discriminant analysis,” Pattern Recognition, vol. 39, pp. 277–287, 2006.
  29. A. Hamilton, J. Madison, and J. Jay, “The Federalist,” in American State Papers (R. M. Hutchins, ed.), vol. 43 of Great Books of the Western World, pp. 29–266, Encyclopedia Britannica, Chicago ed., 1952.
  30. A. Hamilton, J. Madison, and J. Jay, “The Federalist (machine readable).” http://www.gutenberg.org/etext/18.
  31. P. Poplawski, A Jane Austen Encyclopedia. London: Aldwych Press, 1998.
  32. J. Austen and A. Lady, Sanditon. London: Peter Davies, 1975.
  33. D. Hopkinson, “Completions,” in The Jane Austen Companion (J. D. Grey, ed.), Macmillan, 1986.
  34. “Sanditon (machine readable).” http://etext.lib.virginia.edu/toc/modeng/public/AusSndt.html.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: