BSpell: A CNN-Blended BERT Based Bangla Spell Checker (2208.09709v2)
Abstract: Bangla typing is mostly performed using English keyboard and can be highly erroneous due to the presence of compound and similarly pronounced letters. Spelling correction of a misspelled word requires understanding of word typing pattern as well as the context of the word usage. A specialized BERT model named BSpell has been proposed in this paper targeted towards word for word correction in sentence level. BSpell contains an end-to-end trainable CNN sub-model named SemanticNet along with specialized auxiliary loss. This allows BSpell to specialize in highly inflected Bangla vocabulary in the presence of spelling errors. Furthermore, a hybrid pretraining scheme has been proposed for BSpell that combines word level and character level masking. Comparison on two Bangla and one Hindi spelling correction dataset shows the superiority of our proposed approach. BSpell is available as a Bangla spell checking tool via GitHub: https://github.com/Hasiburshanto/Bangla-Spell-Checker
- Probabilistic fasttext for multi-sense word embeddings. arXiv preprint arXiv:1806.02901.
- Layer normalization. arXiv preprint arXiv:1607.06450.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Zhang Chen and Shixiong Xia. 2009. K-means clustering algorithm with improved initial center. In 2009 Second International Workshop on Knowledge Discovery and Data Mining, pages 790–792. IEEE.
- Spellgcn: Incorporating phonological and visual similarities into language models for chinese spelling check. arXiv preprint arXiv:2004.14166.
- Shamil Chollampatt and Hwee Tou Ng. 2017. Connecting the dots: Towards human-level grammatical error correction. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pages 327–333.
- Shamil Chollampatt and Hwee Tou Ng. 2018. A multilayer convolutional encoder-decoder neural network for grammatical error correction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Automatic spelling correction for resource-scarce languages using deep learning. In Proceedings of ACL 2018, Student Research Workshop, pages 146–152, Melbourne, Australia. Association for Computational Linguistics.
- Efficient training of bert by progressively stacking. In International Conference on Machine Learning, pages 2337–2346. PMLR.
- Faspell: A fast, adaptable, simple, powerful chinese spell checker based on dae-decoder paradigm. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 160–169.
- Bangla sentence correction using deep neural network based sequence to sequence learning. In 2018 21st International Conference of Computer and Information Technology (ICCIT), pages 1–6. IEEE.
- Neuspell: A neural spelling correction toolkit. arXiv preprint arXiv:2010.11085.
- Checking the correctness of bangla words using n-gram. International Journal of Computer Application, 89(11).
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Prianka Mandal and BM Mainul Hossain. 2017. Clustering-based bangla spell checker. In 2017 IEEE International Conference on Imaging, Vision & Pattern Recognition (icIVPR), pages 1–6. IEEE.
- Jan Noyes. 1983. The qwerty keyboard: A review. International Journal of Man-Machine Studies, 18(3):265–281.
- BANGLA SPELL CHECKER AND SUGGESTION GENERATOR. Ph.D. thesis, United International University.
- Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681.
- Jonathon Shlens. 2014. A tutorial on principal component analysis. arXiv preprint arXiv:1404.1100.
- Synthetic error dataset generation mimicking bengali writing pattern. In 2020 IEEE Region 10 Symposium (TENSYMP), pages 1363–1366. IEEE.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958.
- Felix Stahlberg and Shankar Kumar. 2021. Synthetic data generation for grammatical error correction with tagged corruption models. arXiv preprint arXiv:2105.13318.
- Ernie 2.0: A continual pre-training framework for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8968–8975.
- Naushad UzZaman and Mumit Khan. 2004. A bangla phonetic encoding for better spelling suggesions. Technical report, BRAC University.
- Naushad UzZaman and Mumit Khan. 2005. A double metaphone encoding for approximate name searching and matching in bangla.
- Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
- Confusionset-guided pointer networks for Chinese spelling check. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5780–5785, Florence, Italy. Association for Computational Linguistics.
- Hanspeller: a unified framework for chinese spelling correction. In International Journal of Computational Linguistics & Chinese Language Processing, Volume 20, Number 1, June 2015-Special Issue on Chinese as a Foreign Language.
- Spelling error correction with soft-masked bert. arXiv preprint arXiv:2005.07421.