Zero-Shot Chinese Character Recognition with Hierarchical Multi-Granularity Image-Text Aligning (2505.24837v1)

Published 30 May 2025 in cs.CV

Abstract: Chinese Character Recognition (CCR) is a fundamental technology for intelligent document processing. Unlike Latin characters, Chinese characters exhibit unique spatial structures and compositional rules, allowing for the use of fine-grained semantic information in representation. However, existing approaches are usually based on auto-regressive as well as edit distance post-process and typically rely on a single-level character representation. In this paper, we propose a Hierarchical Multi-Granularity Image-Text Aligning (Hi-GITA) framework based on a contrastive paradigm. To leverage the abundant fine-grained semantic information of Chinese characters, we propose multi-granularity encoders on both image and text sides. Specifically, the Image Multi-Granularity Encoder extracts hierarchical image representations from character images, capturing semantic cues from localized strokes to holistic structures. The Text Multi-Granularity Encoder extracts stroke and radical sequence representations at different levels of granularity. To better capture the relationships between strokes and radicals, we introduce Multi-Granularity Fusion Modules on the image and text sides, respectively. Furthermore, to effectively bridge the two modalities, we further introduce a Fine-Grained Decoupled Image-Text Contrastive loss, which aligns image and text representations across multiple granularities. Extensive experiments demonstrate that our proposed Hi-GITA significantly outperforms existing zero-shot CCR methods. For instance, it brings about 20% accuracy improvement in handwritten character and radical zero-shot settings. Code and models will be released soon.

Summary

Hierarchical Multi-Granularity Image-Text Aligning: Implications in Zero-Shot Chinese Character Recognition

Chinese Character Recognition (CCR) presents a unique set of challenges, particularly in zero-shot scenarios, due to the intricate structure and large number of symbols in the Chinese writing system. This paper introduces the Hierarchical Multi-Granularity Image-Text Aligning (Hi-GITA) framework as a novel approach to address these challenges, leveraging the ideographic nature of Chinese characters to improve recognition accuracy.

Core Innovations of Hi-GITA

The Hi-GITA framework is built on several key components designed to enhance the alignment between visual and textual representations of Chinese characters:

Multi-Granularity Representation Learning: The framework employs separate encoders for image and text to extract hierarchical representations at varying semantic levels. On the image side, the model learns stroke, radical, and structure representations progressively, while the text side encodes stroke and radical sequences into distinct representations. The novelty lies in the integration of these hierarchical, multi-granular information layers, which provide a richer and more nuanced understanding of character components.
Multi-Granularity Fusion Modules: The Image Multi-Granularity Fusion Module (MGFM-I) and the Text Multi-Granularity Fusion Module (MGFM-T) are designed to facilitate mutual refinement among different granularities. These modules incorporate cross-attention mechanisms allowing stroke-level details to inform radical-level representations and vice versa, ensuring robust component interaction and feature enhancement.
Fine-Grained Decoupled Image-Text Contrastive Loss: Aligning at multiple semantic levels, this loss function performs component-level matching between image and text representations. It decouples detailed description components from structural components within text sequences, leveraging their respective contributions to refine cross-modal alignment.

Performance Metrics and Methodological Advancements

Hi-GITA demonstrates superior performance in zero-shot settings compared to existing methodologies. It achieves an approximate 20% increase in accuracy for handwritten character recognition in radical zero-shot scenarios. This significant improvement underscores the efficacy of employing a hierarchical multi-granularity approach in CCR, particularly in dealing with unseen character instances.

Implications for AI Development

The implications of Hi-GITA's methodological advancements are manifold. At a theoretical level, it underscores the potential of hierarchical representation learning in enhancing the robustness and accuracy of neural networks in complex linguistic tasks. Practically, its successful application in CCR suggests that similar approaches could be employed in other domain-specific recognition tasks, benefiting fields such as historical document analysis and document intelligence.

Future Perspectives

Moving forward, this research opens up several avenues for exploration. Extending the Hi-GITA framework beyond Chinese characters to other ideographic languages—such as Japanese and Korean—is a promising direction. Additionally, the integration of ancient scripts, presenting unique recognition challenges, could benefit from the principles laid out in this research. Moreover, exploring tree-structured decomposition directly within text encoders could further enhance model efficiency and recognition capabilities.

In conclusion, the Zero-Shot Chinese Character Recognition with Hierarchical Multi-Granularity Image-Text Aligning paper presents a substantial step forward in CCR methodologies, demonstrating the potential for broader applications and deeper insights into multimodal alignment techniques for AI systems.