Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EATEN: Entity-aware Attention for Single Shot Visual Text Extraction (1909.09380v1)

Published 20 Sep 2019 in cs.CV

Abstract: Extracting entity from images is a crucial part of many OCR applications, such as entity recognition of cards, invoices, and receipts. Most of the existing works employ classical detection and recognition paradigm. This paper proposes an Entity-aware Attention Text Extraction Network called EATEN, which is an end-to-end trainable system to extract the entities without any post-processing. In the proposed framework, each entity is parsed by its corresponding entity-aware decoder, respectively. Moreover, we innovatively introduce a state transition mechanism which further improves the robustness of entity extraction. In consideration of the absence of public benchmarks, we construct a dataset of almost 0.6 million images in three real-world scenarios (train ticket, passport and business card), which is publicly available at https://github.com/beacandler/EATEN. To the best of our knowledge, EATEN is the first single shot method to extract entities from images. Extensive experiments on these benchmarks demonstrate the state-of-the-art performance of EATEN.

Citations (43)

Summary

  • The paper introduces an end-to-end framework that integrates entity-aware attention to directly extract text entities without manual post-processing.
  • It employs multiple specialized decoders with a state transition mechanism, enhancing accuracy by leveraging contextual relationships among text entities.
  • Evaluations show significant performance gains, achieving a mEA of 95.8% on train tickets and robust precision on mixed-language business cards.

Entity-aware Attention for Single Shot Visual Text Extraction: A Review

The paper "EATEN: Entity-aware Attention for Single Shot Visual Text Extraction" introduces an innovative approach to optical character recognition (OCR) with a focus on extracting specific entities of interest from images. Common applications of this technology include the recognition of data fields in images of passports, business cards, and train tickets. The authors propose an end-to-end trainable system named EATEN, which stands as a single-shot method for visual text extraction.

The proposed framework advances beyond traditional OCR methodologies that typically require extensive handcrafted post-processing rules following character recognition. EATEN integrates an entity-aware attention mechanism directly into the recognition process, allowing it to simultaneously identify and associate text with predefined entities such as "Country" or "Name" without requiring additional manipulations after initial detection.

Technical Contributions

The EATEN framework is built on several key innovations:

  1. End-to-end Trainable System: Unlike conventional techniques that separate detection and recognition phases with manual rules, EATEN combines these steps into a cohesive process that learns to extract specific entities in one operation.
  2. Entity-aware Attention Network: The solution introduces multiple specialized decoders, each responsible for parsing distinct text entities from images. Such decoders incorporate an entity-aware attention mechanism, facilitating precise and efficient extraction of relevant text fields.
  3. State Transition among Decoders: To bolster the robustness of extraction, EATEN employs a state transition mechanism allowing information derived from one decoder to influence subsequent decoding phases. This introduces context between contiguous text entities, improving accuracy.
  4. Real-world Scenario Dataset: In recognition of the scarcity of publicly available datasets for specific text entity extraction, the authors created a substantial compilation of both real and synthetic images representing train tickets, passports, and business cards. The collection entails approximately 0.6 million images, markedly enhancing the dataset landscape for OCR research.

Evaluations and Results

Extensive experimentation on the developed dataset showcases that EATEN outperforms existing approaches substantially. For instance, the framework achieved notable mean Entity Accuracy (mEA) improvements across various test cases, demonstrating its efficacy in both fixed-layout and versatile-layout scenarios. Specific results highlight mEA scores of 95.8% for train tickets, and mEP (mean entity precision) scores of 90.0% for business cards, emphasizing its ability to competently handle complex entity extraction tasks in mixed language contexts including Chinese and English.

The EATEN framework displays substantial computational efficiency gains, significantly reducing processing time when compared to traditional OCR methods. This acceleration is attributed to its streamlined design that circumvents the need for separate detection and recognition stages, enhancing real-time extractive capabilities in practical applications.

Implications and Future Prospects

The introduction of EATEN presents significant implications for both academic and commercial deployment of OCR technologies. By eliminating rigid template dependencies and manual post-processing requirements, this method provides a scalable solution adaptable to diverse real-world document types. As digital content continue to proliferate, such adaptive tools are crucial for enabling seamless automated processing across industries.

Potential future developments may include extensions to further expand the robustness of EATEN under more severe occlusion and noise conditions. Furthermore, integration with advanced neural architectures and exploration of multimodal extraction, where text is correlated with accompanying imagery or audio data, represent promising avenues for broadening applicability and advancing the state of the art in entity-aware text recognition.

In conclusion, EATEN delineates a significant stride in improving the accuracy, speed, and adaptability of text entity extraction from images, encouraging further exploration and refinement to address emerging challenges in the computer vision and natural language processing domains.

Github Logo Streamline Icon: https://streamlinehq.com