Can AI Models Appreciate Document Aesthetics? An Exploration of Legibility and Layout Quality in Relation to Prediction Confidence (2403.18183v1)
Abstract: A well-designed document communicates not only through its words but also through its visual eloquence. Authors utilize aesthetic elements such as colors, fonts, graphics, and layouts to shape the perception of information. Thoughtful document design, informed by psychological insights, enhances both the visual appeal and the comprehension of the content. While state-of-the-art document AI models demonstrate the benefits of incorporating layout and image data, it remains unclear whether the nuances of document aesthetics are effectively captured. To bridge the gap between human cognition and AI interpretation of aesthetic elements, we formulated hypotheses concerning AI behavior in document understanding tasks, specifically anchored in document design principles. With a focus on legibility and layout quality, we tested four aspects of aesthetic effects: noise, font-size contrast, alignment, and complexity, on model confidence using correlational analysis. The results and observations highlight the value of model analysis rooted in document design theories. Our work serves as a trailhead for further studies and we advocate for continued research in this topic to deepen our understanding of how AI interprets document aesthetics.
- S. M. Glynn, F. J. Di Vesta, Control of prose processing via instructional and typographical cues., Journal of Educational Psychology 71 (1979) 595.
- D. B. Felker, Document design: a review of the relevant research. (1980).
- L. Lentz, H. Pander Maat, Functional analysis for document design, Technical communication 51 (2004) 387–398.
- R. Waller, What makes a good document, The criteria we use. Technical paper 2 (2011).
- Layoutlmv3: Pre-training for document ai with unified text and image masking, in: Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 4083–4091.
- Docformer: End-to-end transformer for document understanding, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 993–1003.
- Going full-tilt boogie on document understanding with text-image-layout transformer, in: Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II 16, Springer, 2021, pp. 732–747.
- Funsd: A dataset for form understanding in noisy scanned documents, in: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), volume 2, IEEE, 2019, pp. 1–6.
- Icdar2019 competition on scanned receipt ocr and information extraction, in: 2019 International Conference on Document Analysis and Recognition (ICDAR), IEEE, 2019, pp. 1516–1520.
- Publaynet: largest dataset ever for document layout analysis, in: 2019 International Conference on Document Analysis and Recognition (ICDAR), IEEE, 2019, pp. 1015–1022.
- Doclaynet: A large human-annotated dataset for document-layout segmentation, in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 3743–3751.
- Docvqa: A dataset for vqa on document images, in: Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 2200–2209.
- Doc2graph: A task agnostic document understanding framework based on graph neural networks, in: Lecture Notes in Computer Science, Springer Nature Switzerland, 2023, pp. 329–344. URL: https://doi.org/10.1007%2F978-3-031-25069-9_22. doi:10.1007/978-3-031-25069-9_22.
- Entity relation extraction as dependency parsing in visually rich documents, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 2759–2768.
- Geolayoutlm: Geometric pre-training for visual information extraction, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7092–7101.
- Ernie-layout: Layout knowledge enhanced pre-training for visually-rich document understanding, in: Findings of the Association for Computational Linguistics: EMNLP 2022, 2022, pp. 3744–3756.
- Revealing the dark secrets of bert, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 4365–4374.
- Is bert really robust? a strong baseline for natural language attack on text classification and entailment, in: Proceedings of the AAAI conference on artificial intelligence, volume 34, 2020, pp. 8018–8025.
- Understanding the origins of bias in word embeddings, in: International conference on machine learning, PMLR, 2019, pp. 803–811.
- Y. Belinkov, Probing classifiers: Promises, shortcomings, and advances, Computational Linguistics 48 (2022) 207–219.
- A primer in bertology: What we know about how bert works, Transactions of the Association for Computational Linguistics 8 (2021) 842–866.
- Do attention heads in bert track syntactic dependencies?, NY Academy of Sciences NLP, Dialog, and Speech Workshop (2019).
- A. Ettinger, What bert is not: Lessons from a new suite of psycholinguistic diagnostics for language models, Transactions of the Association for Computational Linguistics 8 (2020) 34–48.
- Adv-bert: Bert is not robust on misspellings! generating nature adversarial samples on bert, arXiv preprint arXiv:2003.04985 (2020).
- BERT: Pre-training of deep bidirectional transformers for language understanding", in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019, pp. 4171–4186.
- Document architecture and text formatting, ACM Transactions on Information Systems (TOIS) 3 (1985) 347–369.
- Document structure analysis algorithms: a literature survey, Document recognition and retrieval X 5010 (2003) 197–207.
- The effects of font type and size on the legibility and reading time of online text by older adults, in: CHI’01 extended abstracts on Human factors in computing systems, 2001, pp. 175–176.
- Text legibility and the letter superiority effect, Human factors 47 (2005) 797–815.
- Ö. Babayigit, The reading speed of elementary school students on the all text written with capital and lowercase letters., Universal Journal of Educational Research 7 (2019) 371–380.
- Likelihood of reading warnings: The effect of fonts and font sizes, in: Proceedings of the Human Factors Society Annual Meeting, volume 36, SAGE Publications Sage CA: Los Angeles, CA, 1992, pp. 926–930.
- M. A. Tinker, Bases for effective reading (1967).
- J. J. Foster, A study of the legibility of one-and two-column layouts for bps publications, Bulletin of the British Psychological Society 23 (1970) 113–114.
- J. R. Baker, Is multiple-column online text better? it depends, Usability News 7 (2005) 1–8.
- A. Kennedy, The spatial coding hypothesis, Eye movements and visual cognition: Scene perception and reading (1992) 379–396.
- Chapter 10 - the reader’s spatial code, in: J. Hyönä, R. Radach, H. Deubel (Eds.), The Mind’s Eye, North-Holland, Amsterdam, 2003, pp. 193–212. URL: https://www.sciencedirect.com/science/article/pii/B9780444510204500128. doi:https://doi.org/10.1016/B978-044451020-4/50012-8.
- R. E. Mayer, C. Pilegard, Principles for managing essential processing in multimedia learning: Segmenting, pretraining, and modality principles, The Cambridge handbook of multimedia learning (2005) 169–182.
- P. Wright, The psychology of layout: Consequences of the visual structure of documents, American Association for Artificial Intelligence Technical Report FS-99-04 (1999) 1–9.
- G. Bonsiepe, A method of quantifying order in typographic design, Visible Language 2 (1968) 203–220.
- Cord: A consolidated receipt dataset for post-ocr parsing (2019).
- DocBank: A benchmark dataset for document layout analysis, in: Proceedings of the 28th International Conference on Computational Linguistics, 2020, pp. 949–960.
- Layoutlmv2: Multi-modal pre-training for visually-rich document understanding, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 2579–2591.
- Attention is not only a weight: Analyzing transformers with vector norms, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 7057–7075.
- Open sesame: Getting inside bert’s linguistic knowledge, in: Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2019, pp. 241–253.
- Evaluation of deep convolutional nets for document image classification and retrieval, in: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), IEEE, 2015, pp. 991–995.
- Ocr-idl: Ocr annotations for industry document library dataset, in: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, Springer, 2023, pp. 241–252.
- V. Kamble, K. Bhurchandi, No-reference image quality assessment algorithms: A survey, Optik-International Journal for Light and Electron Optics 11 (2015) 1090–1097.
- Image quality assessment (iqa) using high-frequency and image variance (hfiv) for colour image, in: Journal of Physics: Conference Series, volume 1372, IOP Publishing, 2019, p. 012034.
- K. De, V. Masilamani, Image sharpness measure for blurred images in frequency domain, Procedia Engineering 64 (2013) 149–158.
- Modelling interface aesthetics, Information Sciences 152 (2003) 25–46.
- Aesthetic measures for automated document layout, in: Proceedings of the 2004 ACM symposium on Document engineering, 2004, pp. 109–111.
- Aesthetic measure of alignment and regularity, in: Proceedings of the 9th ACM Symposium on Document Engineering, 2009, pp. 56–65.
- A review of uncertainty quantification in deep learning: Techniques, applications and challenges, Information Fusion 76 (2021) 243–297.
- On evaluation of document classifiers using rvl-cdip, in: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 2657–2670.
- W. H. DuBay, The principles of readability., Online Submission (2004).
- Unifying vision, text, and layout for universal document processing, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19254–19264.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding, in: International Conference on Machine Learning, PMLR, 2023, pp. 18893–18912.
- Hsiu-Wei Yang (3 papers)
- Abhinav Agrawal (7 papers)
- Pavlos Fragkogiannis (2 papers)
- Shubham Nitin Mulay (1 paper)