2000 character limit reached
TEIMMA: The First Content Reuse Annotator for Text, Images, and Math (2305.13193v2)
Published 22 May 2023 in cs.IR
Abstract: This demo paper presents the first tool to annotate the reuse of text, images, and mathematical formulae in a document pair -- TEIMMA. Annotating content reuse is particularly useful to develop plagiarism detection algorithms. Real-world content reuse is often obfuscated, which makes it challenging to identify such cases. TEIMMA allows entering the obfuscation type to enable novel classifications for confirmed cases of plagiarism. It enables recording different reuse types for text, images, and mathematical formulae in HTML and supports users by visualizing the content reuse in a document pair using similarity detection methods for text and math.
- MathAlign: Linking formula identifiers to their contextual natural language descriptions. In Proceedings of the 12th language resources and evaluation conference. 2204–2212.
- Towards Grounding of Formulae. In Proceedings of the First Workshop on Scholarly Document Processing. 138–147.
- Text reuse detection using a composition of text similarity measures. In Proceedings of COLING 2012. 167–184.
- Hannah Bast and Claudius Korzen. 2017. A benchmark and evaluation for text extraction from PDF. In 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, 1–10.
- Transforming scanned zbMATH volumes to LaTeX: planning the next level digitisation. European Mathematical Society Magazine 117 (2020), 49–52.
- Recognize, Annotate and Visualize Parallel Structures in XML Documents. In 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, Illinois, USA, 258–261. https://doi.org/10.1109/JCDL52503.2021.00078
- Lukas Blecher. 2022. LaTeX-OCR: pix2tex: Using a ViT to convert images of equations into LaTeX code. https://github.com/lukas-blecher/LaTeX-OCR SWHID: swh:1:dir:6affa30af9a3e35dfc8a9e4175647e2f95e9033c. [Software: Accessed 21-Jan-2023].
- Antonio J. Calderón Martín. 2014. Lie algebras with a set grading. Linear Algebra Appl. 452 (2014), 7–20. https://doi.org/10.1016/j.laa.2014.03.031
- Image-to-Markup Generation with Coarse-to-Fine Attention. https://doi.org/10.48550/ARXIV.1609.04938
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. https://doi.org/10.48550/ARXIV.2010.11929
- Academic Plagiarism Detection: A Systematic Literature Review. Comput. Surveys 52, 6 (Oct. 2019), 112:1–112:42. https://doi.org/10.1145/3345317
- Vinay Kanigicherla. 2021. pdftolatex: Python tool for generation of latex code from PDF files. https://github.com/vinaykanigicherla/pdftolatex SWHID: swh:1:dir:713a4905fcc2c65d5618a226b2d67019451e7dda. [Software: Accessed 21-Jan-2023].
- Norman Meuschke. 2021. Analyzing Non-Textual Content Elements to Detect Academic Plagiarism. Doctoral Thesis. University of Konstanz, Dept. of Computer and Information Science, Konstanz, Germany. https://doi.org/10.5281/zenodo.4913345
- A Benchmark of PDF Information Extraction Tools Using a Multi-task and Multi-domain Evaluation Framework for Academic Documents. In Information for a Better World: Normality, Virtuality, Physicality, Inclusivity. LNCS, Vol. 13972. Springer Nature Switzerland, Cham, 383–405. https://doi.org/10.1007/978-3-031-28032-0_31
- Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL) (Urbana-Champaign, IL, USA). https://doi.org/10.1109/JCDL.2019.00026
- Rishabh Mittal and Anchal Garg. 2020. Text extraction using OCR: a systematic review. In 2020 Second International Conference on Inventive Research in Computing Applications (ICIRCA). IEEE, 357–362.
- PAWLS: PDF annotation with labels and structure. arXiv preprint arXiv:2101.10281 (2021).
- NIST. 2007. LATExml: A LATEX to XML/HTML/MathML Converter — math.nist.gov. https://math.nist.gov/~BMiller/LaTeXML/. [Accessed 21-Jan-2023].
- José M. Sánchez. 2018. Leibniz algebras with a set grading. RETRACTED. Uzb. Math. J. 2018, 2 (2018), 74–92. https://doi.org/10.29229/uzmj.2018-2-7
- Adaptive algorithm for plagiarism detection: The best-performing approach at PAN 2014 text alignment competition. In International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, 402–413.
- TEIMMA: The First Content Reuse Annotator for Text, Images, and Math. https://github.com/gipplab/TEIMMA-Reuse-Annotator SWHID: swh:1:dir:a3b95e4ce8893030696393525c1d5a71d27aa303. [Software: Accessed 21-Jan-2023].
- AnnoMathTeX - a Formula Identifier Annotation Recommender System for STEM Documents. In Proceedings of the 13th ACM Conference on Recommender Systems (RecSys 2019). ACM, Copenhagen, Denmark. https://doi.org/10.1145/3298689.3347042
- Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context. In Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL). Fort Worth, USA. https://doi.org/10.1145/3197026.3197058
- Mathematical Expressions in Software Engineering Artifacts. In Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD). 238–242.
- Attention Is All You Need. https://doi.org/10.48550/ARXIV.1706.03762
- Zelun Wang and Jyh-Charn Liu. 2020. PDF2LaTeX: A Deep Learning System to Convert Mathematical Documents from PDF to LaTeX. In Proceedings of the ACM Symposium on Document Engineering 2020. 1–10.