Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers (2002.06144v4)

Published 14 Feb 2020 in cs.CV, cs.CL, cs.IR, and cs.LG

Abstract: The massive amounts of digitized historical documents acquired over the last decades naturally lend themselves to automatic processing and exploration. Research work seeking to automatically process facsimiles and extract information thereby are multiplying with, as a first essential step, document layout analysis. If the identification and categorization of segments of interest in document images have seen significant progress over the last years thanks to deep learning techniques, many challenges remain with, among others, the use of finer-grained segmentation typologies and the consideration of complex, heterogeneous documents such as historical newspapers. Besides, most approaches consider visual features only, ignoring textual signal. In this context, we introduce a multimodal approach for the semantic segmentation of historical newspapers that combines visual and textual features. Based on a series of experiments on diachronic Swiss and Luxembourgish newspapers, we investigate, among others, the predictive power of visual and textual features and their capacity to generalize across time and sources. Results show consistent improvement of multimodal models in comparison to a strong visual baseline, as well as better robustness to high material variance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. JW300: A wide-coverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3204–3210, Florence, Italy, July 2019. Association for Computational Linguistics. 10.18653/v1/P19-1310. URL https://www.aclweb.org/anthology/P19-1310.
  2. Contextual string embeddings for sequence labeling. In COLING 2018, 27th International Conference on Computational Linguistics, pages 1638–1649, 2018.
  3. dhSegment: A generic deep-learning approach for document segmentation. In 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pages 7–12, August 2018. 10.1109/ICFHR-2018.2018.00011.
  4. Document layout analysis: A comprehensive survey. ACM Comput. Surv., 52(6), October 2019. ISSN 0360-0300. 10.1145/3355610. URL https://doi.org/10.1145/3355610.
  5. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017.
  6. Recognition of the logical structure of arabic newspaper pages. In International Conference on Text, Speech, and Dialogue, pages 251–258. Springer, 2018.
  7. Layout analysis on newspaper archives. In Digital Humanities 2017, pages 409–412, 2017.
  8. Convolutional Neural Networks for Page Segmentation of Historical Document Images. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 01, pages 965–970, November 2017. 10.1109/ICDAR.2017.161.
  9. The ENP image and ground truth dataset of historical newspapers. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pages 931–935. IEEE, 2015.
  10. PRImA, DMAS2019, Competition on Digitised Magazine Article Segmentation (ICDAR 2019), 2019. URL https://www.primaresearch.org/DMAS2019/.
  11. Natural language processing (almost) from scratch. Journal of machine learning research, 12(Aug):2493–2537, 2011.
  12. Tuan Anh Nguyen Dang and Dat Nguyen Thanh. End-to-End Information Extraction by Character-Level Embedding and Multi-Stage Attentional U-Net. In 2019 British Machine Vision Conference, page 13, 2019.
  13. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  14. BERTgrid: Contextualized embedding for 2d document representation and understanding. In Workshop on Document Intelligence at NeurIPS 2019, 2019.
  15. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. 10.18653/v1/N19-1423. URL https://www.aclweb.org/anthology/N19-1423.
  16. Tree-structured named entity recognition on OCR data: Analysis, processing and results. In LREC, pages 1266–1272, 2012.
  17. The VIA annotation software for images, audio and video. In Proceedings of the 27th ACM International Conference on Multimedia, MM ’19, New York, NY, USA, 2019. ACM. ISBN 978-1-4503-6889-6/19/10. 10.1145/3343031.3350535. URL https://doi.org/10.1145/3343031.3350535.
  18. A comprehensive survey of mostly textual document segmentation algorithms since 2008. Pattern Recognition, 64:1–14, April 2017. 10.1016/j.patcog.2016.10.023.
  19. Integrated algorithms for newspaper page decomposition and article tracking. In Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR’99 (Cat. No. PR00318), pages 559–562. IEEE, 1999.
  20. Arabic newspaper page segmentation. In ICDAR, volume 3, pages 895–899, 2003.
  21. Multi-scale multi-task FCN for semantic page segmentation and table detection. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), page nil, November 2017a. 10.1109/icdar.2017.50.
  22. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  23. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017b.
  24. Automatic article extraction in old newspapers digitized collections. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pages 3–8, 2014.
  25. BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages. In Nicoletta Calzolari (Conference chair), Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12, 2018 2018. European Language Resources Association (ELRA). ISBN 979-10-95546-00-9.
  26. Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. In Advances in neural information processing systems, pages 1945–1953, 2017.
  27. Big Data of the Past. Frontiers in Digital Humanities, 4, 2017. ISSN 2297-2668. 10.3389/fdigh.2017.00012. URL https://www.frontiersin.org/articles/10.3389/fdigh.2017.00012/full.
  28. Chargrid: Towards understanding 2d documents. CoRR, 2018.
  29. Mining the twentieth century’s history from the time magazine corpus. EACL 2014, page 62, 2014. URL http://www.aclweb.org/anthology/W/W14/W14-06.pdf#page=72.
  30. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  31. Content analysis of 150 years of british periodicals. Proceedings of the National Academy of Sciences, 114(4):E457–E465, 2017. ISSN 0027-8424. 10.1073/pnas.1606380114. URL https://www.pnas.org/content/114/4/E457.
  32. Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page nil, June 2015. 10.1109/cvpr.2015.7298965.
  33. Developing an image-based classifier for detecting poetic content in historic newspaper collections. D-Lib Magazine, 21(7/8), 2015.
  34. Fully convolutional neural networks for newspaper article segmentation. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), page nil, November 2017. 10.1109/icdar.2017.75.
  35. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  36. Jean-Philippe Moreux. Innovative Approaches of Historical Newspapers: Data Mining, Data Visualization, Semantic Enrichment. In Proceedings of IFLA WLIC 2016, page 17, Columbus, OH, 2016. URL http://library.ifla.org/id/eprint/2076.
  37. Deep contextualized word representations. In Proc. of NAACL, 2018.
  38. Why comparing single performance scores does not allow to draw conclusions about machine learning approaches. arXiv preprint arXiv:1803.09578, 2018.
  39. The past, present and future of digital scholarship with newspaper collections. In DH 2019 Book of Abstracts, page 9, 2019. URL http://infoscience.epfl.ch/record/271329. Multi-paper panel presented at the 2019 Digital Humanities Conference, Utrecht, July 2019.
  40. Clustering-based article identification in historical newspapers. In Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 12–17, 2019.
  41. ICDAR 2019 Competition on Post-OCR Text Correction. In 15th International Conference on Document Analysis and Recognition, Sydney, Australia, September 2019. URL https://hal.archives-ouvertes.fr/hal-02304334.
  42. U-net: Convolutional networks for biomedical image segmentation. CoRR, 2015.
  43. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
  44. Melissa M Terras. The Rise of Digitization. In Ruth Rikowski, editor, Digitisation Perspectives, pages 3–20. SensePublishers, Rotterdam, 2011. ISBN 978-94-6091-299-3. 10.1007/978-94-6091-299-3_1. URL http://dx.doi.org/10.1007/978-94-6091-299-3_1http://www.emeraldinsight.com.ezproxy.lancs.ac.uk/doi/full/10.1108/OIR-06-2015-0193.
  45. Melvin Wevers. Using word embeddings to examine gender bias in Dutch newspapers, 1950-1990. In Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change, pages 92–97, Florence, Italy, August 2019. Association for Computational Linguistics. 10.18653/v1/W19-4712. URL https://www.aclweb.org/anthology/W19-4712.
  46. Fully convolutional neural networks for page segmentation of historical document images. In 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), page nil, April 2018. 10.1109/das.2018.39.
  47. Page Segmentation for Historical Handwritten Documents Using Fully Convolutional Networks. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 01, pages 541–546, November 2017. 10.1109/ICDAR.2017.94.
  48. Topic modeling on historical newspapers. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTecH), pages 96–104, 2011.
  49. Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page nil, July 2017. 10.1109/cvpr.2017.462.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Raphaël Barman (1 paper)
  2. Maud Ehrmann (4 papers)
  3. Simon Clematide (14 papers)
  4. Sofia Ares Oliveira (2 papers)
  5. Frédéric Kaplan (5 papers)
Citations (37)

Summary

We haven't generated a summary for this paper yet.