Extracting Lexical Features from Dialects via Interpretable Dialect Classifiers
Abstract: Identifying linguistic differences between dialects of a language often requires expert knowledge and meticulous human analysis. This is largely due to the complexity and nuance involved in studying various dialects. We present a novel approach to extract distinguishing lexical features of dialects by utilizing interpretable dialect classifiers, even in the absence of human experts. We explore both post-hoc and intrinsic approaches to interpretability, conduct experiments on Mandarin, Italian, and Low Saxon, and experimentally demonstrate that our method successfully identifies key language-specific lexical features that contribute to dialectal variations.
- Findings of the VarDial evaluation campaign 2022. In Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, Gyeongju, Republic of Korea. International Committee on Computational Linguistics (ICCL).
- Findings of the VarDial evaluation campaign 2022. In Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 1–13, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Findings of the VarDial evaluation campaign 2023. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pages 251–261, Dubrovnik, Croatia. Association for Computational Linguistics.
- David Alvarez-Melis and Tommi S. Jaakkola. 2018. Towards robust interpretability with self-explaining neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 7786–7795, Red Hook, NY, USA. Curran Associates Inc.
- Sercan Ömer Arik and Tomas Pfister. 2020. Protoattend: Attention-based prototypical learning. J. Mach. Learn. Res., 21:210:1–210:35.
- Interpretability and analysis in neural NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pages 1–5, Online. Association for Computational Linguistics.
- Eric Brill. 1991. Discovering the lexical features of a language. In 29th Annual Meeting of the Association for Computational Linguistics, pages 339–340.
- Jack K Chambers and Peter Trudgill. 1998. Dialectology. Cambridge University Press.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
- Ryan Cotterell and Chris Callison-Burch. 2014. A multi-dialect, multi-genre corpus of informal written Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 241–245, Reykjavik, Iceland. European Language Resources Association (ELRA).
- Learning to recognize dialect features. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2315–2338, Online. Association for Computational Linguistics.
- Automated rationale generation: a technique for explainable ai and its effects on human perceptions. In Proceedings of the 24th International Conference on Intelligent User Interfaces, pages 263–274.
- Unsupervised deep language and dialect identification for short texts. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1606–1617, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Data-driven dependency parsing of vedic sanskrit. Language Resources and Evaluation, pages 1–34.
- Syntax-guided localized self-attention by constituency syntactic distance. arXiv preprint arXiv:2210.11759.
- Understanding convolutional neural networks for text classification. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 56–65, Brussels, Belgium. Association for Computational Linguistics.
- Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research, 65:675–782.
- Towards hierarchical importance attribution: Explaining compositional semantics for neural sequence models. CoRR, abs/1911.06194.
- The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
- Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pages 79–86, Phuket, Thailand.
- Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 1885–1894. PMLR.
- Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 4765–4774.
- Automatic diacritics restoration for tunisian dialect. ACM Trans. Asian Low-Resour. Lang. Inf. Process., 18(3).
- SHAP-based explanation methods: A review for NLP interpretability. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4593–4603, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Detecting shibboleths. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 72–80.
- SELFEXPLAIN: A self-explaining architecture for neural text classifiers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 836–850, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- "why should I trust you?": Explaining the predictions of any classifier. CoRR, abs/1602.04938.
- Attention-based interpretability with concept transformers. In International Conference on Learning Representations.
- FRMT: A benchmark for few-shot region-aware machine translation. CoRR, abs/2210.00193.
- Fine-grained Arabic dialect identification. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1332–1344, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Proceedings of the ninth workshop on nlp for similar languages, varieties and dialects. In Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects.
- Tenth workshop on nlp for similar languages, varieties and dialects (vardial 2023). In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023).
- Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 3145–3153. JMLR.org.
- LSDC - a comprehensive dataset for low Saxon dialect classification. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 25–35, Barcelona, Spain (Online). International Committee on Computational Linguistics (ICCL).
- Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR, abs/1312.6034.
- Rethinking cooperative rationalization: Introspective extraction and complement control. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4094–4103, Hong Kong, China. Association for Computational Linguistics.
- Omar F. Zaidan and Chris Callison-Burch. 2011. The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 37–41, Portland, Oregon, USA. Association for Computational Linguistics.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.