Glyce: Glyph-vectors for Chinese Character Representations

Published 29 Jan 2019 in cs.CL, cs.AI, and cs.CV | (1901.10125v5)

Abstract: It is intuitive that NLP tasks for logographic languages like Chinese should benefit from the use of the glyph information in those languages. However, due to the lack of rich pictographic evidence in glyphs and the weak generalization ability of standard computer vision models on character data, an effective way to utilize the glyph information remains to be found. In this paper, we address this gap by presenting Glyce, the glyph-vectors for Chinese character representations. We make three major innovations: (1) We use historical Chinese scripts (e.g., bronzeware script, seal script, traditional Chinese, etc) to enrich the pictographic evidence in characters; (2) We design CNN structures (called tianzege-CNN) tailored to Chinese character image processing; and (3) We use image-classification as an auxiliary task in a multi-task learning setup to increase the model's ability to generalize. We show that glyph-based models are able to consistently outperform word/char ID-based models in a wide range of Chinese NLP tasks. We are able to set new state-of-the-art results for a variety of Chinese NLP tasks, including tagging (NER, CWS, POS), sentence pair classification, single sentence classification tasks, dependency parsing, and semantic role labeling. For example, the proposed model achieves an F1 score of 80.6 on the OntoNotes dataset of NER, +1.5 over BERT; it achieves an almost perfect accuracy of 99.8\% on the Fudan corpus for text classification. Code found at https://github.com/ShannonAI/glyce.

Abstract PDF Upgrade to Chat

Authors (10)

Citations (179)

View on Semantic Scholar

Summary

The paper presents a novel method that leverages historical Chinese scripts to enhance glyph-based character representations.
It introduces the Tianzige-CNN architecture and an auxiliary image-classification objective to improve model generalization.
Experimental results demonstrate significant gains in NER, text classification, and other NLP tasks, validating its effectiveness.

Glyce: Glyph-vectors for Chinese Character Representations

The paper "Glyce: Glyph-vectors for Chinese Character Representations" presents a novel approach to incorporating glyph information into NLP tasks for Chinese, a logographic language. The research addresses the challenges of exploiting glyph information, which traditionally has seen limited success due to poor generalization of standard computer vision models and the lack of rich pictographic evidence in contemporary Chinese scripts.

Key Contributions

Historical Scripts for Enhanced Pictographic Evidence: The authors leverage a variety of historical Chinese scripts including bronzeware, seal, clerical, and cursive scripts, alongside traditional and simplified Chinese. This strategic inclusion captures lost pictographic information, providing a rich data source for character representation.
Tianzige-CNN Architecture: The paper introduces the Tianzige-CNN, a convolutional neural network tailored specifically for Chinese character images. This architecture addresses the distinct challenges posed by the smaller size and limited variety of Chinese character images, enhancing local feature capture and preventing overfitting.
Auxiliary Image-Classification Objective: An image-classification task is incorporated as a regularizing function in the multi-task learning setup to improve generalization. This auxiliary task helps in enhancing the model's robustness in capturing character semantics.

Experimental Results and Implications

The Glyce model demonstrates significant improvements across a spectrum of Chinese NLP tasks:

Named Entity Recognition (NER): On the OntoNotes dataset, Glyce surpasses BERT, achieving an F1 score of 80.6, marking a +1.5 improvement.
Text Classification: An accuracy of 99.8% is reported on the Fudan corpus for text classification tasks.
Word Segmentation and POS Tagging: Glyce consistently outperforms existing models by integrating character-level semantics with glyph vector information.

The research provides a robust benchmark for various tasks, including sentence pair classification, dependency parsing, and semantic role labeling, setting state-of-the-art (SOTA) results. These improvements underscore the potential of integrating glyph information to enhance the understanding of logographic languages in NLP.

Theoretical and Practical Implications

Theoretically, the integration of glyph information advances the semantic modeling framework for logographic languages, challenging the limitations of traditional character ID-based models. Practically, Glyce offers a generalizable approach that can be embedded in any deep learning system, similar to how word embeddings are utilized.

Future Directions

Considering the promising results, future research could explore:

Extending Glyce to other logographic languages like Japanese kanji.
Refining architectural designs and enhancing training strategies for better efficiency.
Incorporating glyph vectors into more complex tasks such as machine translation and dialogue models.

Glyce's methodology represents a significant advancement in modeling Chinese characters and proposes a comprehensive framework applicable to other logographic systems, fostering further innovations in the field of NLP for logographic languages.

Markdown Report Issue