Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Optimizing a Data Science System for Text Reuse Analysis (2401.07290v1)

Published 14 Jan 2024 in cs.DB

Abstract: Text reuse is a methodological element of fundamental importance in humanities research: pieces of text that re-appear across different documents, verbatim or paraphrased, provide invaluable information about the historical spread and evolution of ideas. Large modern digitized corpora enable the joint analysis of text collections that span entire centuries and the detection of large-scale patterns, impossible to detect with traditional small-scale analysis. For this opportunity to materialize, it is necessary to develop efficient data science systems that perform the corresponding analysis tasks. In this paper, we share insights from ReceptionReader, a system for analyzing text reuse in large historical corpora. The system is built upon billions of instances of text reuses from large digitized corpora of 18th-century texts. Its main functionality is to perform downstream text reuse analysis tasks, such as finding reuses that stem from a given article or identifying the most reused quotes from a set of documents, with each task expressed as a database query. For the purposes of the paper, we discuss the related design choices including various database normalization levels and query execution frameworks, such as distributed data processing (Apache Spark), indexed row store engine (MariaDB Aria), and compressed column store engine (MariaDB Columnstore). Moreover, we present an extensive evaluation with various metrics of interest (latency, storage size, and computing costs) for varying workloads, and we offer insights from the trade-offs we observed and the choices that emerged as optimal in our setting. In summary, our results show that (1) for the workloads that are most relevant to text-reuse analysis, the MariaDB Aria framework emerges as the overall optimal choice, (2) big data processing (Apache Spark) is irreplaceable for all processing stages of the system's pipeline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Column-Stores vs. Row-Stores: How Different Are They Really?. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (Vancouver, Canada) (SIGMOD ’08). Association for Computing Machinery, New York, NY, USA, 967–980. https://doi.org/10.1145/1376616.1376712
  2. Basic local alignment search tool. Journal of Molecular Biology 215, 3 (1990), 403–410. https://doi.org/10.1016/S0022-2836(05)80360-2
  3. Chris Biemann. 2006. Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems. In Proceedings of TextGraphs: the First Workshop on Graph Based Methods for Natural Language Processing, Rada Mihalcea and Dragomir Radev (Eds.). Association for Computational Linguistics, New York City, 73–80. https://aclanthology.org/W06-3812
  4. Finding the Best between the Column Store and Row Store Databases. In Proceedings of the 10th International Conference on Information Systems and Technologies (Lecce, Italy) (ICIST ’20). Association for Computing Machinery, New York, NY, USA, Article 40, 4 pages. https://doi.org/10.1145/3447568.3448548
  5. Mass Digitization of Early Modern Texts With Optical Character Recognition. J. Comput. Cult. Herit. 11, 1, Article 6 (dec 2017), 25 pages. https://doi.org/10.1145/3075645
  6. impresso Text Reuse at Scale. An interface for the exploration of text reuse data in semantically enriched historical newspapers. Frontiers in big data 6 (2023), 1249469. https://doi.org/10.3389/fdata.2023.1249469
  7. Gale. [n.d.]. British Library Newspapers. https://www.gale.com/intl/primary-sources/british-library-newspapers
  8. Gale. 2003. Eighteenth Century Collections Online. https://www.gale.com/intl/primary-sources/eighteenth-century-collections-online
  9. Gale and University of Michigan. [n.d.]. Eighteenth Century Collections Online - Text Creation Partnership (ECCO-TCP). https://quod.lib.umich.edu/e/ecco/
  10. Stephen H. Gregg. 2021. Old Books and Digital Publishing: Eighteenth-Century Collections Online. Cambridge University Press.
  11. Anshul Gupta. 2015. Assessment of OCR Quality and Font Identification in Historical Documents. Master’s thesis. Texas A & M University.
  12. Finding Parallel Passages in Cultural Heritage Archives. J. Comput. Cult. Herit. 11, 3, Article 15 (aug 2018), 24 pages. https://doi.org/10.1145/3195727
  13. Mark J Hill and Simon Hengchen. 2019. Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study. Digital Scholarship in the Humanities 34, 4 (04 2019), 825–843. https://doi.org/10.1093/llc/fqz024 arXiv:https://academic.oup.com/dsh/article-pdf/34/4/825/33046904/fqz024.pdf
  14. University of Michigan. 2009. Early English Books Online - Text Creartion Partnership (EEBO-TCP). https://quod.lib.umich.edu/e/eebogroup/
  15. Digging into ECCO: Identifying Commonplaces and other Forms of Text Reuse at Scale.. In Digital Humanities 2016: Conference Abstracts. 336–339.
  16. Computational Methods for Uncovering Reprinted Texts in Antebellum Newspapers. American Literary History 27, 3 (06 2015), E1–E15. https://doi.org/10.1093/alh/ajv029 arXiv:https://academic.oup.com/alh/article-pdf/27/3/E1/194881/ajv029.pdf
  17. Aleksi Vesanto. 2018. Detecting and Analyzing Text Reuse with BLAST. Master’s thesis. University of Turku.
  18. A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora. In Proceedings of the 21st Nordic Conference on Computational Linguistics, Jörg Tiedemann and Nina Tahmasebi (Eds.). Association for Computational Linguistics, Gothenburg, Sweden, 330–333. https://aclanthology.org/W17-0249
  19. Applying BLAST to Text Reuse Detection in Finnish Newspapers and Journals, 1771-1910. In Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language, Gerlof Bouma and Yvonne Adesam (Eds.). Linköping University Electronic Press, Gothenburg, 54–58. https://aclanthology.org/W17-0510
  20. Peng Ye and David Doermann. 2013. Document Image Quality Assessment: A Brief Survey. In 2013 12th International Conference on Document Analysis and Recognition. 723–727. https://doi.org/10.1109/ICDAR.2013.148

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com