Papers
Topics
Authors
Recent
2000 character limit reached

Block the Label and Noise: An N-Gram Masked Speller for Chinese Spell Checking

Published 5 May 2023 in cs.CL | (2305.03314v1)

Abstract: Recently, Chinese Spell Checking(CSC), a task to detect erroneous characters in a sentence and correct them, has attracted extensive interest because of its wide applications in various NLP tasks. Most of the existing methods have utilized BERT to extract semantic information for CSC task. However, these methods directly take sentences with only a few errors as inputs, where the correct characters may leak answers to the model and dampen its ability to capture distant context; while the erroneous characters may disturb the semantic encoding process and result in poor representations. Based on such observations, this paper proposes an n-gram masking layer that masks current and/or surrounding tokens to avoid label leakage and error disturbance. Moreover, considering that the mask strategy may ignore multi-modal information indicated by errors, a novel dot-product gating mechanism is proposed to integrate the phonological and morphological information with semantic representation. Extensive experiments on SIGHAN datasets have demonstrated that the pluggable n-gram masking mechanism can improve the performance of prevalent CSC models and the proposed methods in this paper outperform multiple powerful state-of-the-art models.

Citations (1)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.