Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding (2203.06947v2)

Published 14 Mar 2022 in cs.CV and cs.CL

Abstract: Recently, various multimodal networks for Visually-Rich Document Understanding(VRDU) have been proposed, showing the promotion of transformers by integrating visual and layout information with the text embeddings. However, most existing approaches utilize the position embeddings to incorporate the sequence information, neglecting the noisy improper reading order obtained by OCR tools. In this paper, we propose a robust layout-aware multimodal network named XYLayoutLM to capture and leverage rich layout information from proper reading orders produced by our Augmented XY Cut. Moreover, a Dilated Conditional Position Encoding module is proposed to deal with the input sequence of variable lengths, and it additionally extracts local layout information from both textual and visual modalities while generating position embeddings. Experiment results show that our XYLayoutLM achieves competitive results on document understanding tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhangxuan Gu (17 papers)
  2. Changhua Meng (27 papers)
  3. Ke Wang (531 papers)
  4. Jun Lan (30 papers)
  5. Weiqiang Wang (171 papers)
  6. Ming Gu (39 papers)
  7. Liqing Zhang (80 papers)
Citations (66)

Summary

We haven't generated a summary for this paper yet.