Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TEIMMA: The First Content Reuse Annotator for Text, Images, and Math (2305.13193v2)

Published 22 May 2023 in cs.IR

Abstract: This demo paper presents the first tool to annotate the reuse of text, images, and mathematical formulae in a document pair -- TEIMMA. Annotating content reuse is particularly useful to develop plagiarism detection algorithms. Real-world content reuse is often obfuscated, which makes it challenging to identify such cases. TEIMMA allows entering the obfuscation type to enable novel classifications for confirmed cases of plagiarism. It enables recording different reuse types for text, images, and mathematical formulae in HTML and supports users by visualizing the content reuse in a document pair using similarity detection methods for text and math.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. MathAlign: Linking formula identifiers to their contextual natural language descriptions. In Proceedings of the 12th language resources and evaluation conference. 2204–2212.
  2. Towards Grounding of Formulae. In Proceedings of the First Workshop on Scholarly Document Processing. 138–147.
  3. Text reuse detection using a composition of text similarity measures. In Proceedings of COLING 2012. 167–184.
  4. Hannah Bast and Claudius Korzen. 2017. A benchmark and evaluation for text extraction from PDF. In 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, 1–10.
  5. Transforming scanned zbMATH volumes to LaTeX: planning the next level digitisation. European Mathematical Society Magazine 117 (2020), 49–52.
  6. Recognize, Annotate and Visualize Parallel Structures in XML Documents. In 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, Illinois, USA, 258–261. https://doi.org/10.1109/JCDL52503.2021.00078
  7. Lukas Blecher. 2022. LaTeX-OCR: pix2tex: Using a ViT to convert images of equations into LaTeX code. https://github.com/lukas-blecher/LaTeX-OCR SWHID: swh:1:dir:6affa30af9a3e35dfc8a9e4175647e2f95e9033c. [Software: Accessed 21-Jan-2023].
  8. Antonio J. Calderón Martín. 2014. Lie algebras with a set grading. Linear Algebra Appl. 452 (2014), 7–20. https://doi.org/10.1016/j.laa.2014.03.031
  9. Image-to-Markup Generation with Coarse-to-Fine Attention. https://doi.org/10.48550/ARXIV.1609.04938
  10. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. https://doi.org/10.48550/ARXIV.2010.11929
  11. Academic Plagiarism Detection: A Systematic Literature Review. Comput. Surveys 52, 6 (Oct. 2019), 112:1–112:42. https://doi.org/10.1145/3345317
  12. Vinay Kanigicherla. 2021. pdftolatex: Python tool for generation of latex code from PDF files. https://github.com/vinaykanigicherla/pdftolatex SWHID: swh:1:dir:713a4905fcc2c65d5618a226b2d67019451e7dda. [Software: Accessed 21-Jan-2023].
  13. Norman Meuschke. 2021. Analyzing Non-Textual Content Elements to Detect Academic Plagiarism. Doctoral Thesis. University of Konstanz, Dept. of Computer and Information Science, Konstanz, Germany. https://doi.org/10.5281/zenodo.4913345
  14. A Benchmark of PDF Information Extraction Tools Using a Multi-task and Multi-domain Evaluation Framework for Academic Documents. In Information for a Better World: Normality, Virtuality, Physicality, Inclusivity. LNCS, Vol. 13972. Springer Nature Switzerland, Cham, 383–405. https://doi.org/10.1007/978-3-031-28032-0_31
  15. Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL) (Urbana-Champaign, IL, USA). https://doi.org/10.1109/JCDL.2019.00026
  16. Rishabh Mittal and Anchal Garg. 2020. Text extraction using OCR: a systematic review. In 2020 Second International Conference on Inventive Research in Computing Applications (ICIRCA). IEEE, 357–362.
  17. PAWLS: PDF annotation with labels and structure. arXiv preprint arXiv:2101.10281 (2021).
  18. NIST. 2007. LATExml: A LATEX to XML/HTML/MathML Converter — math.nist.gov. https://math.nist.gov/~BMiller/LaTeXML/. [Accessed 21-Jan-2023].
  19. José M. Sánchez. 2018. Leibniz algebras with a set grading. RETRACTED. Uzb. Math. J. 2018, 2 (2018), 74–92. https://doi.org/10.29229/uzmj.2018-2-7
  20. Adaptive algorithm for plagiarism detection: The best-performing approach at PAN 2014 text alignment competition. In International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, 402–413.
  21. TEIMMA: The First Content Reuse Annotator for Text, Images, and Math. https://github.com/gipplab/TEIMMA-Reuse-Annotator SWHID: swh:1:dir:a3b95e4ce8893030696393525c1d5a71d27aa303. [Software: Accessed 21-Jan-2023].
  22. AnnoMathTeX - a Formula Identifier Annotation Recommender System for STEM Documents. In Proceedings of the 13th ACM Conference on Recommender Systems (RecSys 2019). ACM, Copenhagen, Denmark. https://doi.org/10.1145/3298689.3347042
  23. Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context. In Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL). Fort Worth, USA. https://doi.org/10.1145/3197026.3197058
  24. Mathematical Expressions in Software Engineering Artifacts. In Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD). 238–242.
  25. Attention Is All You Need. https://doi.org/10.48550/ARXIV.1706.03762
  26. Zelun Wang and Jyh-Charn Liu. 2020. PDF2LaTeX: A Deep Learning System to Convert Mathematical Documents from PDF to LaTeX. In Proceedings of the ACM Symposium on Document Engineering 2020. 1–10.
Citations (3)

Summary

We haven't generated a summary for this paper yet.