Glyce: Glyph-vectors for Chinese Character Representations (1901.10125v5)

Published 29 Jan 2019 in cs.CL, cs.AI, and cs.CV

Abstract: It is intuitive that NLP tasks for logographic languages like Chinese should benefit from the use of the glyph information in those languages. However, due to the lack of rich pictographic evidence in glyphs and the weak generalization ability of standard computer vision models on character data, an effective way to utilize the glyph information remains to be found. In this paper, we address this gap by presenting Glyce, the glyph-vectors for Chinese character representations. We make three major innovations: (1) We use historical Chinese scripts (e.g., bronzeware script, seal script, traditional Chinese, etc) to enrich the pictographic evidence in characters; (2) We design CNN structures (called tianzege-CNN) tailored to Chinese character image processing; and (3) We use image-classification as an auxiliary task in a multi-task learning setup to increase the model's ability to generalize. We show that glyph-based models are able to consistently outperform word/char ID-based models in a wide range of Chinese NLP tasks. We are able to set new state-of-the-art results for a variety of Chinese NLP tasks, including tagging (NER, CWS, POS), sentence pair classification, single sentence classification tasks, dependency parsing, and semantic role labeling. For example, the proposed model achieves an F1 score of 80.6 on the OntoNotes dataset of NER, +1.5 over BERT; it achieves an almost perfect accuracy of 99.8\% on the Fudan corpus for text classification. Code found at https://github.com/ShannonAI/glyce.

PDF Abstract

Glyce: Glyph-vectors for Chinese Character Representations

The paper "Glyce: Glyph-vectors for Chinese Character Representations" presents a novel approach to incorporating glyph information into NLP tasks for Chinese, a logographic language. The research addresses the challenges of exploiting glyph information, which traditionally has seen limited success due to poor generalization of standard computer vision models and the lack of rich pictographic evidence in contemporary Chinese scripts.

Key Contributions

Historical Scripts for Enhanced Pictographic Evidence: The authors leverage a variety of historical Chinese scripts including bronzeware, seal, clerical, and cursive scripts, alongside traditional and simplified Chinese. This strategic inclusion captures lost pictographic information, providing a rich data source for character representation.
Tianzige-CNN Architecture: The paper introduces the Tianzige-CNN, a convolutional neural network tailored specifically for Chinese character images. This architecture addresses the distinct challenges posed by the smaller size and limited variety of Chinese character images, enhancing local feature capture and preventing overfitting.
Auxiliary Image-Classification Objective: An image-classification task is incorporated as a regularizing function in the multi-task learning setup to improve generalization. This auxiliary task helps in enhancing the model's robustness in capturing character semantics.

Experimental Results and Implications

The Glyce model demonstrates significant improvements across a spectrum of Chinese NLP tasks:

Named Entity Recognition (NER): On the OntoNotes dataset, Glyce surpasses BERT, achieving an F1 score of 80.6, marking a +1.5 improvement.
Text Classification: An accuracy of 99.8% is reported on the Fudan corpus for text classification tasks.
Word Segmentation and POS Tagging: Glyce consistently outperforms existing models by integrating character-level semantics with glyph vector information.

The research provides a robust benchmark for various tasks, including sentence pair classification, dependency parsing, and semantic role labeling, setting state-of-the-art (SOTA) results. These improvements underscore the potential of integrating glyph information to enhance the understanding of logographic languages in NLP.

Theoretical and Practical Implications

Theoretically, the integration of glyph information advances the semantic modeling framework for logographic languages, challenging the limitations of traditional character ID-based models. Practically, Glyce offers a generalizable approach that can be embedded in any deep learning system, similar to how word embeddings are utilized.

Future Directions

Considering the promising results, future research could explore:

Extending Glyce to other logographic languages like Japanese kanji.
Refining architectural designs and enhancing training strategies for better efficiency.
Incorporating glyph vectors into more complex tasks such as machine translation and dialogue models.

Glyce's methodology represents a significant advancement in modeling Chinese characters and proposes a comprehensive framework applicable to other logographic systems, fostering further innovations in the field of NLP for logographic languages.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Yuxian Meng (37 papers)
Wei Wu (481 papers)
Fei Wang (573 papers)
Xiaoya Li (42 papers)
Ping Nie (23 papers)
Fan Yin (34 papers)
Muyu Li (3 papers)
Qinghong Han (11 papers)
Xiaofei Sun (36 papers)
Jiwei Li (137 papers)

Citations (179)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - ShannonAI/glyce: Code for NeurIPS 2019 - Glyce: Glyph-vectors for Chinese Character Representations (419 stars)

Tweets

https://twitter.com/DrPyRepo/status/1825821875559137785