CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition (2111.11011v5)

Published 22 Nov 2021 in cs.CV and cs.AI

Abstract: The Transformer-based encoder-decoder framework is becoming popular in scene text recognition, largely because it naturally integrates recognition clues from both visual and semantic domains. However, recent studies show that the two kinds of clues are not always well registered and therefore, feature and character might be misaligned in difficult text (e.g., with a rare shape). As a result, constraints such as character position are introduced to alleviate this problem. Despite certain success, visual and semantic are still separately modeled and they are merely loosely associated. In this paper, we propose a novel module called Multi-Domain Character Distance Perception (MDCDP) to establish a visually and semantically related position embedding. MDCDP uses the position embedding to query both visual and semantic features following the cross-attention mechanism. The two kinds of clues are fused into the position branch, generating a content-aware embedding that well perceives character spacing and orientation variants, character semantic affinities, and clues tying the two kinds of information. They are summarized as the multi-domain character distance. We develop CDistNet that stacks multiple MDCDPs to guide a gradually precise distance modeling. Thus, the feature-character alignment is well built even various recognition difficulties are presented. We verify CDistNet on ten challenging public datasets and two series of augmented datasets created by ourselves. The experiments demonstrate that CDistNet performs highly competitively. It not only ranks top-tier in standard benchmarks, but also outperforms recent popular methods by obvious margins on real and augmented datasets presenting severe text deformation, poor linguistic support, and rare character layouts. Code is available at https://github.com/simplify23/CDistNet.

PDF Abstract

CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition

The paper presented proposes a novel approach named CDistNet, aimed at enhancing the efficacy of scene text recognition, especially when encountering challenging scenarios rife with text distortion and complex character layouts. Scene text recognition has persistently been a major focus within the computer vision community, primarily owing to its critical application in numerous vision-related tasks. Despite substantial advancements, the irregular nature of real-world text remains a significant hurdle. CDistNet ambitiously addresses this challenge by introducing a module called Multi-Domain Character Distance Perception (MDCDP), which mediates the alignment of visual and semantic recognition clues more cohesively.

The foundational architecture of CDistNet builds upon the popular Transformer-based encoder-decoder paradigm, which is proficient in concurrently managing visual and semantic clues. However, existing approaches often treat these domains separately or sequentially, leading to the problem of character misalignment or so-called "attention drift," where the visual features are not perfectly synchronized with the characters, particularly for texts with unique spatial layouts or complex deformations.

MDCDP innovatively utilizes position embeddings as a conduit for synchronizing visual and semantic features through a cross-attention mechanism. This approach effectively perceives character spacing, orientation variations, and semantic affinities among characters to produce a more integrated and contextually aware character representation, termed as perceiving multi-domain character distance.

The robustness of the CDistNet architecture becomes apparent when evaluated across several challenging datasets, including ten public datasets and newly created augmented datasets capturing varying degrees of text deformation and orientation. The series of experiments conducted validate CDistNet's performance, showing significant superiority in handling irregular text and outperforming several recent state-of-the-art models.

The introduction of CDistNet presents compelling implications for both practical applications and theoretical understanding of text recognition in complex scenes. On a practical front, the methodology promises to substantially enhance the accuracy of automated systems involved in optical character recognition (OCR) across varying contextual scenarios. From a theoretical perspective, CDistNet provides a framework for more thoroughly integrating and utilizing multi-domain information in sequence models, potentially inspiring future research to explore deeper interactions between visual and semantic cues.

This research work contributes significantly to the field by:

Proposing a novel module MDCDP that fosters a more comprehensive feature representation through simultaneous cross-attention between visual and semantic domains.
Developing CDistNet, which stacks multiple MDCDPs to progressively refine feature-character alignment, enhancing robustness across diverse text irregularities.
Demonstrating pronounced improvements over existing models through rigorous experimental validation, especially in datasets designed to simulate text scenarios with intricate deformations.

While CDistNet showcases remarkable advancements, it also heralds future inquiries into optimizing the efficiency and scalability of such methodologies, particularly in real-time applications. Further explorations might focus on reducing computational demands while maintaining high accuracy or extending the multi-domain perception to include other potentially informative cues such as color and texture. The publicly available code also facilitates community engagement, potentially catalyzing additional innovations in the OCR landscape.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Tianlun Zheng (4 papers)
Zhineng Chen (30 papers)
Shancheng Fang (11 papers)
Hongtao Xie (48 papers)
Yu-Gang Jiang (223 papers)

Citations (47)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - simplify23/CDistNet: Official Pytorch implementations of CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition（IJCV） (113 stars)