Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TookaBERT: A Step Forward for Persian NLU (2407.16382v1)

Published 23 Jul 2024 in cs.CL

Abstract: The field of NLP has seen remarkable advancements, thanks to the power of deep learning and foundation models. LLMs, and specifically BERT, have been key players in this progress. In this study, we trained and introduced two new BERT models using Persian data. We put our models to the test, comparing them to seven existing models across 14 diverse Persian natural language understanding (NLU) tasks. The results speak for themselves: our larger model outperforms the competition, showing an average improvement of at least +2.8 points. This highlights the effectiveness and potential of our new BERT models for Persian NLU tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Farstail: a persian natural language inference dataset. Soft Computing.
  2. Mohammad Yasin Ayoubi, Sajjad & Davoodeh. 2021. Persianqa: a dataset for persian question answering. https://github.com/SajjjadAyobi/PersianQA.
  3. [optimizing pre-trained bert-based models for persian language processing].
  4. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  5. Unsupervised Cross-lingual Representation Learning at Scale. arXiv preprint. ArXiv:1911.02116 [cs].
  6. Pre-Training with Whole Word Masking for Chinese BERT. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3504–3514. ArXiv:1906.08101 [cs].
  7. Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv preprint. ArXiv:2307.08691 [cs].
  8. Pquad: A persian question answering dataset. Computer Speech Language, 80:101486.
  9. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  10. Parsbert: Transformer-based model for persian language understanding. Neural Processing Letters, 53(6):3831–3847.
  11. AriaBERT: A Pre-trained Persian BERT Model for Natural Language Understanding. ISSN: 2693-5015.
  12. Sentipers: A sentiment analysis corpus for persian. arXiv preprint arXiv:1801.07737.
  13. Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics.
  14. Hamzeh Motahari Khansari and Mehrnoush Shamsfard. 2021. Hmblogs: A big general persian corpus. Preprint, arXiv:2111.02362.
  15. ParsiNLU: A suite of language understanding challenges for Persian. Transactions of the Association for Computational Linguistics, 9:1147–1162.
  16. Madlad-400: A multilingual and document-level large audited dataset. Preprint, arXiv:2309.04662.
  17. XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13142–13152, Singapore. Association for Computational Linguistics.
  18. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint. ArXiv:1907.11692 [cs].
  19. MultiCoNER: A large-scale multilingual dataset for complex named entity recognition. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3798–3809, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  20. FaBERT: Pre-training BERT on Persian Blogs. arXiv preprint. ArXiv:2402.06617 [cs].
  21. Armanemo: A persian dataset for text-based emotion detection. arXiv preprint.
  22. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
  23. MosaicBERT: A bidirectional encoder optimized for fast pretraining. In Thirty-seventh Conference on Neural Information Processing Systems.
  24. BPE-dropout: Simple and effective subword regularization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1882–1892, Online. Association for Computational Linguistics.
  25. Improving language understanding by generative pre-training.
  26. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. arXiv preprint. ArXiv:1910.02054 [cs, stat].
  27. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
  28. Peyma: A tagged corpus for persian named entities. Preprint, arXiv:1801.09936.
  29. Deepsentipers: Novel deep learning models trained over proposed augmented persian sentiment corpus. Preprint, arXiv:2004.05328.
  30. Llama 2: Open foundation and fine-tuned chat models. Preprint, arXiv:2307.09288.
  31. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  32. Superglue: A stickier benchmark for general-purpose language understanding systems. Preprint, arXiv:1905.00537.
  33. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
  34. Boosting distributed training performance of the unpadded bert model. Preprint, arXiv:2208.08124.
  35. S. Mehran M. Ziabary. 2024. Targoman/PersianWebScraper: An accurate scrapper to scrape popular persian websites, mostly intended to be used as a tool to create large corpora for Persian language. — github.com. https://github.com/Targoman/PersianWebScraper.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)