Processing the structure of documents: Logical Layout Analysis of historical newspapers in French (2202.08125v2)
Abstract: Background. In recent years, libraries and archives led important digitisation campaigns that opened the access to vast collections of historical documents. While such documents are often available as XML ALTO documents, they lack information about their logical structure. In this paper, we address the problem of Logical Layout Analysis applied to historical documents in French. We propose a rule-based method, that we evaluate and compare with two Machine-Learning models, namely RIPPER and Gradient Boosting. Our data set contains French newspapers, periodicals and magazines, published in the first half of the twentieth century in the Franche-Comt\'e Region. Results. Our rule-based system outperforms the two other models in nearly all evaluations. It has especially better Recall results, indicating that our system covers more types of every logical label than the other two models. When comparing RIPPER with Gradient Boosting, we can observe that Gradient Boosting has better Precision scores but RIPPER has better Recall scores. Conclusions. The evaluation shows that our system outperforms the two Machine Learning models, and provides significantly higher Recall. It also confirms that our system can be used to produce annotated data sets that are large enough to envisage Machine Learning or Deep Learning approaches for the task of Logical Layout Analysis. Combining rules and Machine Learning models into hybrid systems could potentially provide even better performances. Furthermore, as the layout in historical documents evolves rapidly, one possible solution to overcome this problem would be to apply Rule Learning algorithms to bootstrap rule sets adapted to different publication periods.
- Contextual string embeddings for sequence labeling. In COLING 2018, 27th International Conference on Computational Linguistics, pages 1638–1649, 2018.
- FinTOC-2019 shared task: Finding title in text blocks. In Proceedings of the Second Financial Narrative Processing Workshop (FNP 2019), pages 58–62, Turku, Finland, September 2019. Linköping University Electronic Press. URL https://www.aclweb.org/anthology/W19-6408.
- Persian heritage image binarization competition (phibc 2012). pages 1–4, 03 2013. ISBN 978-1-4673-6204-7. 10.1109/PRIA.2013.6528442.
- Combining visual and textual features for semantic segmentation of historical newspapers. ArXiv, abs/2002.06144, 2020.
- Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.
- Layout analysis of handwritten historical documents for searching the archive of the cabinet of the dutch queen. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), volume 1, pages 357–361, 2007. 10.1109/ICDAR.2007.4378732.
- Convolutional neural networks for page segmentation of historical document images, 2017.
- The enp image and ground truth dataset of historical newspapers. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pages 931–935, 2015. 10.1109/ICDAR.2015.7333898.
- William W Cohen. Repeated incremental pruning to produce error reduction. In Machine Learning Proceedings of the Twelfth International Conference ML95, 1995.
- Incremental reduced error pruning. In Machine Learning Proceedings 1994, pages 70–77. Elsevier, 1994.
- Dataset for Logical-layout analysis on French historical newspapers, October 2021.
- Marti A. Hearst. Texttiling: Segmenting text into multi-paragraph subtopic passages. Comput. Linguist., 23(1):33–64, March 1997. ISSN 0891-2017.
- Automatic article extraction in old newspapers digitized collections. ACM International Conference Proceeding Series, 05 2014. 10.1145/2595188.2595195.
- On the application of voronoi diagrams to page segmentation. 1999.
- S. Klampfl and Roman Kern. An unsupervised machine learning approach to body text and table of contents extraction from digital scientific articles. In TPDL, 2013.
- A prototype document image analysis system for technical journals. Computer, 25:10–22, 1992.
- Document Structure and Layout Analysis, pages 29–48. 03 2007. ISBN 978-1-84628-501-1. 10.1007/978-1-84628-726-8_2.
- D. Niyogi and S.N. Srihari. Knowledge-based derivation of document logical structure. In Proceedings of 3rd International Conference on Document Analysis and Recognition, volume 1, pages 472–475 vol.1, 1995. 10.1109/ICDAR.1995.599038.
- L. O’Gorman. The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell., 15:1162–1173, 1993.
- Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics. 10.3115/v1/D14-1162. URL https://aclanthology.org/D14-1162.
- J. Ross Quinlan. Learning logical definitions from relations. Machine Learning, 5:239–266, 2005.
- Layout-aware text extraction from full-text pdf of scientific articles. Source code for biology and medicine, 7:7, 05 2012. 10.1186/1751-0473-7-7.
- Clustering-based article identification in historical newspapers. In Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 12–17, Minneapolis, USA, June 2019. Association for Computational Linguistics. 10.18653/v1/W19-2502.
- Claude Sammut and Geoffrey I. Webb, editors. Encyclopedia of Machine Learning and Data Mining. Springer, 2017. ISBN 978-1-4899-7685-7. 10.1007/978-1-4899-7687-1.
- A large dataset of historical japanese documents with complex layouts. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2336–2343, 2020.
- Diva-hisdb: A precisely annotated large dataset of challenging medieval manuscripts. 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pages 471–476, 2016.
- H. Tibbo. Primarily history in america: How u.s. historians search for primary materials at the dawn of the digital age. American Archivist, 66:9–50, 2007.
- Publaynet: Largest dataset ever for document layout analysis. 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1015–1022, 2019.
- Logical layout analysis using deep learning. In 2019 Digital Image Computing: Techniques and Applications (DICTA), pages 1–5, 2019. 10.1109/DICTA47822.2019.8946046.