Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Augmentation Strategy for Visually Rich Documents (2212.10047v2)

Published 20 Dec 2022 in cs.CL

Abstract: Many business workflows require extracting important fields from form-like documents (e.g. bank statements, bills of lading, purchase orders, etc.). Recent techniques for automating this task work well only when trained with large datasets. In this work we propose a novel data augmentation technique to improve performance when training data is scarce, e.g. 10-250 documents. Our technique, which we call FieldSwap, works by swapping out the key phrases of a source field with the key phrases of a target field to generate new synthetic examples of the target field for use in training. We demonstrate that this approach can yield 1-7 F1 point improvements in extraction performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jing Xie (17 papers)
  2. James B. Wendt (4 papers)
  3. Yichao Zhou (33 papers)
  4. Seth Ebner (9 papers)
  5. Sandeep Tata (14 papers)

Summary

We haven't generated a summary for this paper yet.