Watermark Text Pattern Spotting in Document Images (2401.05167v2)
Abstract: Watermark text spotting in document images can offer access to an often unexplored source of information, providing crucial evidence about a record's scope, audience and sometimes even authenticity. Stemming from the problem of text spotting, detecting and understanding watermarks in documents inherits the same hardships - in the wild, writing can come in various fonts, sizes and forms, making generic recognition a very difficult problem. To address the lack of resources in this field and propel further research, we propose a novel benchmark (K-Watermark) containing 65,447 data samples generated using Wrender, a watermark text patterns rendering procedure. A validity study using humans raters yields an authenticity score of 0.51 against pre-generated watermarked documents. To prove the usefulness of the dataset and rendering technique, we developed an end-to-end solution (Wextract) for detecting the bounding box instances of watermark text, while predicting the depicted text. To deal with this specific task, we introduce a variance minimization loss and a hierarchical self-attention mechanism. To the best of our knowledge, we are the first to propose an evaluation benchmark and a complete solution for retrieving watermarks from documents surpassing baselines by 5 AP points in detection and 4 points in character accuracy.
- Layoutlmv2: Multi-modal pre-training for visually-rich document understanding, arXiv preprint arXiv:2012.14740 (2020).
- Selfdoc: Self-supervised document representation learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5652–5660.
- Docformer: End-to-end transformer for document understanding, arXiv preprint arXiv:2106.11539 (2021).
- Formnet: Structural encoding beyond sequential modeling in form document information extraction, arXiv preprint arXiv:2203.08411 (2022).
- Layoutlmv3: Pre-training for document ai with unified text and image masking, arXiv preprint arXiv:2204.08387 (2022).
- Consent: Context sensitive transformer for bold words classification, arXiv preprint arXiv:2205.07683 (2022).
- Tanda: Transfer and adapt pre-trained transformer models for answer sentence selection, in: AAAI, 2020.
- Contextualized embeddings based transformer encoder for sentence similarity modeling in answer selection task, in: Proceedings of The 12th Language Resources and Evaluation Conference, 2020, pp. 5505–5514.
- Towards explainable ai: Assessing the usefulness and impact of added explainability features in legal document summarization, in: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, 2021, pp. 1–7.
- Analysis of graphsum’s attention weights to improve the explainability of multi-document summarization, in: The 23rd International Conference on Information Integration and Web Intelligence, 2021, pp. 359–366.
- Bertgcn: Transductive text classification by combining gcn and bert, arXiv preprint arXiv:2105.05727 (2021).
- AWS-Textract, Textract-aws ocr engine, https://aws.amazon.com/textract (2019).
- Glass: Global to local attention for scene-text spotting, in: ECCV, Springer, 2022, pp. 249–266.
- Fots: Fast oriented text spotting with a unified network, in: CVPR, 2018, pp. 5676–5685.
- Aster: An attentional scene text recognizer with flexible rectification, PAMI 41 (2018) 2035–2048.
- Char-net: A character-aware neural network for distorted scene text recognition, in: AAAI, volume 32, 2018.
- Lal: Linguistically aware learning for scene text recognition, in: Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 4051–4059.
- Character region attention for text spotting, in: ECCV, Springer, 2020, pp. 504–521.
- Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting, arXiv preprint arXiv:2105.03620 (2021).
- Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition, in: CVPR, 2021, pp. 7098–7107.
- Robustscanner: Dynamically enhancing positional clues for robust text recognition, in: ECCV, Springer, 2020, pp. 135–151.
- Deep features for text spotting, in: ECCV, Springer, 2014, pp. 512–528.
- Realtime multi-person 2d pose estimation using part affinity fields, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7291–7299.
- Document understanding dataset and evaluation (dude), in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19528–19540.
- Structural similarity for document image classification and retrieval, Pattern Recognition Letters 43 (2014) 119–126.
- J.-P. T. Guillaume Jaume, Hazim Kemal Ekenel, Funsd: A dataset for form understanding in noisy scanned documents, in: ICDAR-OST, 2019.
- Unstructured object matching using co-salient region segmentation, in: CVPR Workshops, 2022, pp. 5051–5060.
- Large sequence representation learning via multi-stage latent transformers, in: Proceedings of the 29th International Conference on Computational Linguistics, 2022, pp. 4633–4639.
- Faster r-cnn: Towards real-time object detection with region proposal networks, NIPS 28 (2015) 91–99.
- You only look once: Unified, real-time object detection, in: CVPR, 2016, pp. 779–788.
- Efficientdet: Scalable and efficient object detection, in: CVPR, 2020, pp. 10781–10790.
- Focal loss for dense object detection, in: ICCV, 2017.
- End-to-end object detection with transformers, in: ECCV, Springer, 2020, pp. 213–229.
- Exploring plain vision transformer backbones for object detection, in: ECCV, Springer, 2022, pp. 280–296.
- k-nn embeded space conditioning for enhanced few-shot object detection, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 401–410.
- Kleister: key information extraction datasets involving long documents with complex layouts, in: ICDAR, Springer, 2021, pp. 564–579.
- K. Lang, T. Mitchell, Newsgroup 20 dataset, 1999.
- Textsnake: A flexible representation for detecting text of arbitrary shapes, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 20–36.
- Real-time scene text detection with differentiable binarization and adaptive scale fusion, PAMI (2022).
- Towards unified scene text spotting based on sequence generation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15223–15232.
- Turning a clip model into a scene text detector, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6978–6988.
- Deep residual learning for image recognition, in: NIPS, 2017.
- Swin transformer: Hierarchical vision transformer using shifted windows, in: ICCV, 2021, pp. 10012–10022.
- Attention is all you need, in: NIPS, 2017.
- Master: Multi-aspect non-local network for scene text recognition, Pattern Recognition 117 (2021) 107980.
- On recognizing texts of arbitrary shapes with 2d self-attention, in: CVPR Workshops, 2020, pp. 546–547.
- D. Bautista, R. Atienza, Scene text recognition with permuted autoregressive sequence models, in: European Conference on Computer Vision, Springer Nature Switzerland, Cham, 2022, pp. 178–196. URL: https://doi.org/10.1007/978-3-031-19815-1_11. doi:10.1007/978-3-031-19815-1_11.
- Microsoft coco: Common objects in context, in: ECCV, 2014.
- I. Loshchilov, F. Hutter, Decoupled weight decay regularization, in: ICLR, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.