FATURA: Synthetic Multi-Layout Invoice Dataset
- FATURA is a large-scale, synthetic invoice image dataset featuring 10,000 images with 24 annotated semantic classes across 50 distinct layouts.
- The dataset uses a combined manual and synthetic annotation process to ensure realistic variability, precise bounding boxes, and strict privacy.
- FATURA supports robust evaluation via intra- and inter-template splits, enabling advanced OCR, key-value extraction, and multimodal document understanding.
FATURA is a large-scale, synthetic, multi-layout invoice image dataset developed to address the requirements of advanced document analysis and understanding, particularly in scenarios where both layout variability and privacy concerns are central. Comprising 10,000 synthetically generated invoice images annotated across 24 semantic classes and spanning 50 distinct layouts, it represents the largest openly accessible invoice document image resource to date designed for the research community (Limam et al., 2023). FATURA's design enables research not only in optical character recognition (OCR) but also in key-value extraction, layout analysis, and multimodal document understanding, with extensive support for benchmarking under both intra- and inter-layout generalization regimes.
1. Dataset Composition and Generation Process
FATURA consists of exactly 10,000 invoice images in JPEG format, distributed over 50 base templates. Each template incorporates substantial diversity in visual presentation via variations in font styles, text positioning, and graphical elements. Within each template, individual images are further diversified by position-randomization and by replacing textual content with randomized yet semantically plausible fields (e.g., names, totals, dates, addresses).
The dataset generation process combines manual annotation and synthetic augmentation. Genuine invoice scans are first annotated using the VGG Image Annotator to establish bounding boxes for salient invoice regions (e.g., logos, totals, tables). These annotated layouts are subsequently repopulated with synthetic content for each image: text is randomized from curated repositories, and logos are generated using a pre-trained text-to-image latent diffusion model. This combination replicates real-world heterogeneity encountered in invoice documents while assuring that no sensitive user data or proprietary logos are present.
2. Annotation Structure and Supported Formats
For each invoice image, an accompanying JSON file encodes exhaustive per-instance annotation. The annotation schema spans 24 classes, such as ‘TABLE’, ‘LOGO’, ‘DATE’, ‘NUMBER’, ‘SELLER ADDRESS’, ‘TOTAL’, and additional invoice-relevant entities. Bounding boxes for each instance are tightly fit to relevant regions and contain explicit class labels.
The dataset is distributed in three annotation formats:
- COCO format: Standardized for object detection and image segmentation benchmarks.
- HuggingFace Transformers-compatible format: Tuned for direct integration with document understanding architectures such as LayoutLMv3.
- Custom standard format: Flexible for bespoke research pipelines.
The annotation process ensures that each file can be immediately adopted for region-level detection, entity extraction, or key-value matching tasks with minimal reformatting. Figures in the source manuscript (e.g., Figure \ref{fig:samples}) demonstrate the visual and semantic richness of annotations, while the dataset documentation details the frequency and definition of each class.
3. Addressing Challenges in Document Diversity and Privacy
Invoice documents are characterized by high variability in structure—differing tabular layouts, key-value organization, embedded graphics, and nonuniform field positioning. Existing datasets commonly demonstrate limited template diversity and often entail privacy risks due to their dependence on authentic business documents.
FATURA systematically confronts these challenges:
- Layout variance is realized via 50 distinct templates, each annotated and procedurally re-populated to produce a wide distribution of invoice images.
- The replacement of all original content with randomized, realistic data ensures that confidential financial information never enters the dataset.
- Logos and other emblematic elements are generated entirely synthetically.
- A rigorous annotation protocol delivers bounding box precision suitable for downstream document analysis tasks, supporting both traditional and multimodal models.
The design ensures that FATURA is privacy-friendly without compromising on utility for tasks requiring real-world diversity of invoice appearance and structure.
4. Evaluation Strategies and Benchmarking Paradigms
To rigorously assess model generalization, FATURA supports two evaluation protocols, each targeting a different aspect of layout variability:
- Intra-template split: Each template is partitioned into training, validation, and test splits. Here, model evaluation examines robustness to content but within familiar structural layouts.
- Inter-template split: Entire templates are reserved exclusively for validation and test splits, such that models are challenged to perform on layouts entirely unseen during training.
These protocols collectively probe model performance on both seen and novel layouts, crucial for robust real-world invoice automation. The dataset documentation includes benchmarking baselines and figures (e.g., Figure \ref{fig:counts}) that chart class frequency distributions and support evaluation of class imbalance effects.
5. Applications in Document Analysis and Automation
FATURA provides an essential benchmark for several advanced document understanding tasks:
- Document Layout Analysis: Segmentation of images into semantically meaningful regions (e.g., headers, tables, logos).
- Key-Value Extraction: Reliable annotation of fields such as total, date, or invoice number enables model training and evaluation for high-precision information extraction pipelines.
- Multimodal Document Understanding: The multi-format annotation design allows integration with modern transformer-based architectures (e.g., LayoutLMv3), enabling joint reasoning over text, layout, and vision signals.
- Invoice Automation: The diversity of templates and content enhances the ecological validity of research for downstream applications, such as expense management, financial auditing, and robotic process automation.
A plausible implication is that models trained on FATURA are more likely to generalize robustly across diverse invoice documents encountered in industry applications, given its unparalleled coverage of layout and content variance.
6. Privacy Safeguards and Dataset Availability
The synthetic nature of FATURA’s textual and graphical content is central to its privacy guarantees. The generative process eliminates any trace of original sensitive information, and the dataset contains neither proprietary business data nor real-user identifiers. This distinguishes FATURA from datasets constructed solely from real invoices and ensures regulatory compliance and ethical standards for open-access research.
The dataset is openly and freely available to researchers at https://zenodo.org/record/8261508, accompanied by comprehensive documentation including annotation format specifications, evaluation scripts, and sample visualization tools.
7. Summary and Significance
FATURA represents a substantive advance in resources for machine learning-driven document analysis and understanding, specifically within the domain of financial document processing. Its methodological foundations—multi-template diversity, rich annotations, privacy-aware synthesis, and support for robust generalization evaluation—establish a new standard for benchmarking algorithms tasked with OCR, key-value identification, and layout parsing. The resource directly addresses long-standing limitations of prior datasets and underpins future research in end-to-end document automation, hybrid vision-LLMing, and real-world invoice understanding (Limam et al., 2023).