Authorship Attribution in Bangla Literature (AABL) via Transfer Learning using ULMFiT (2403.05519v1)
Abstract: Authorship Attribution is the task of creating an appropriate characterization of text that captures the authors' writing style to identify the original author of a given piece of text. With increased anonymity on the internet, this task has become increasingly crucial in various security and plagiarism detection fields. Despite significant advancements in other languages such as English, Spanish, and Chinese, Bangla lacks comprehensive research in this field due to its complex linguistic feature and sentence structure. Moreover, existing systems are not scalable when the number of author increases, and the performance drops for small number of samples per author. In this paper, we propose the use of Average-Stochastic Gradient Descent Weight-Dropped Long Short-Term Memory (AWD-LSTM) architecture and an effective transfer learning approach that addresses the problem of complex linguistic features extraction and scalability for authorship attribution in Bangla Literature (AABL). We analyze the effect of different tokenization, such as word, sub-word, and character level tokenization, and demonstrate the effectiveness of these tokenizations in the proposed model. Moreover, we introduce the publicly available Bangla Authorship Attribution Dataset of 16 authors (BAAD16) containing 17,966 sample texts and 13.4+ million words to solve the standard dataset scarcity problem and release six variations of pre-trained LLMs for use in any Bangla NLP downstream task. For evaluation, we used our developed BAAD16 dataset as well as other publicly available datasets. Empirically, our proposed model outperformed state-of-the-art models and achieved 99.8% accuracy in the BAAD16 dataset. Furthermore, we showed that the proposed system scales much better even with an increasing number of authors, and performance remains steady despite few training samples.
- 2022. Regularization of Neural Networks Using DropConnect. https://cds.nyu.edu/projects/regularization-neural-networks-using-dropconnect/
- An Empirical Framework to Identify Authorship from Bengali Literary Works. In International Conference on Cyber Security and Computer Science. Springer, 465–476.
- Identifying Author in Bengali Literature by Bi-LSTM with Attention Mechanism. In 2021 24th International Conference on Computer and Information Technology (ICCIT). IEEE, 1–6.
- DM Anisuzzaman and Abdus Salam. 2018. Authorship attribution for Bengali language using the fusion of N-gram and Naïve bayes algorithms. International Journal of Information Technology and Computer Science (IJITCS) 10, 10 (2018), 11–21.
- Banner: A Cost-Sensitive Contextualized Model for Bangla Named Entity Recognition. IEEE Access 8 (2020), 58206–58226.
- D Bagnall. 2016. Authorship clustering using multi-headed recurrent neural networks—notebook for PAN at CLEF 2016. In CLEF 2016 Evaluation Labs and Workshop–Working Notes Papers. 5–8.
- Kurt Barry and Katherine Luna. 2012. Stylometry for online forums.
- Inflectional morphology synthesis for bengali noun, pronoun and verb systems. In Proc. of the National Conference on Computer Processing of Bangla (NCCPB 05). Citeseer, 34–43.
- Tanmoy Chakraborty. 2012. Authorship identification in bengali literature: a comparative analysis. arXiv preprint arXiv:1208.6268 (2012).
- Authorship Attribution in Bengali Literature using Convolutional Neural Networks with fastText’s word embedding model. In 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT). IEEE, 1–5.
- Authorship Attribution in Bengali Literature Using fastText’s Hierarchical Classifier. In 2018 4th International Conference on Electrical Engineering and Information & Communication Technology (iCEEiCT). IEEE, 102–106.
- A Comparative Analysis of Word Embedding Representations in Authorship Attribution of Bengali Literature. (2018).
- A continuous space neural language model for bengali language. In 2019 22nd International Conference on Computer and Information Technology (ICCIT). IEEE, 1–6.
- Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019).
- Universal Language Model Fine-Tuning for Polish Hate Speech Detection. Proceedings ofthePolEval2019Workshop (2019), 149.
- Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in neural information processing systems. 3079–3087.
- An experimental study of stylometry in bangla literature. In 2015 2nd International Conference on Electrical Information and Communication Technologies (EICT). IEEE, 575–580.
- Suprabhat Das and Pabitra Mitra. 2011. Author identification in bengali literary works. In International Conference on Pattern Recognition and Machine Intelligence. Springer, 220–226.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Using Ontology for Revealing Authorship Attribution of Arabic Text. International Journal of Engineering and Advanced Technology (IJEAT) 9, 4 (2020), 143–151.
- Language models and fusion for authorship attribution. Information Processing & Management 56, 6 (2019), 102061.
- Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks. Advances in neural information processing systems (2016).
- Erik Goldman and Abel Allison. [n.d.]. Using Grammatical Markov Models for Stylometric Analysis. ([n. d.]).
- Authorship attribution with convolutional neural networks and POS-Eliding. In Proceedings of the Workshop on Stylistic Variation. 53–58.
- A stylometric analysis on Bengali literature for authorship attribution. (2017).
- Banfakenews: A dataset for detecting fake news in bangla. arXiv preprint arXiv:2004.08789 (2020).
- Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 328–339.
- Authorship Attribution on Bengali Literature using Stylometric Features and Neural Network. In 2018 4th International Conference on Electrical Engineering and Information & Communication Technology (iCEEiCT). IEEE, 360–363.
- Automatic authorship detection from Bengali text using stylometric approach. In 2017 20th International Conference of Computer and Information Technology (ICCIT). IEEE, 1–6.
- Syntactic Recurrent Neural Network for Authorship Attribution. arXiv preprint arXiv:1902.09723 (2019).
- Deep Learning Based Sentiment Analysis in a Code-Mixed English-Hindi and English-Bengali Social Media Corpus. International Journal on Artificial Intelligence Tools (2020).
- Bengali Ethnicity Recognition and Gender Classification using CNN & Transfer Learning. In 2019 8th International Conference System Modeling and Advancement in Research Trends (SMART). IEEE, 390–396.
- Classification Benchmarks for Under-resourced Bengali Language based on Multichannel Convolutional-LSTM Network. arXiv preprint arXiv:2004.07807 (2020).
- Authorship Attribution in Bangla literature using Character-level CNN. In 2019 22nd International Conference on Computer and Information Technology (ICCIT). IEEE, 1–5.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
- Tim Kreutz and Walter Daelemans. 2018. Exploring classifier combinations for language variety identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects, Santa Fe, New Mexico, USA, August 20, 2018. 191–198.
- Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959 (2018).
- Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226 (2018).
- Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291 (2019).
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
- Ilya Loshchilov and Frank Hutter. 2016. SGDR: Stochastic Gradient Descent with Restarts. CoRR (2016).
- S. Lynn-Evans. 2019. Ten Techniques Learned From fast.ai. https://blog.floydhub.com/ten-techniques-from-fast-ai/
- Regularizing and Optimizing LSTM Language Models. CoRR (2017).
- Regularizing and Optimizing LSTM Language Models. CoRR (2018).
- Zinnia Khan Nishat and Md Shopon. 2019. Unsupervised Pretraining and Transfer Learning-Based Bangla Sign Language Recognition. In International Joint Conference on Computational Intelligence. Springer, 529–540.
- A machine learning approach for stylometric analysis of Bangla literature. (2017).
- Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2227–2237.
- A machine learning approach for authorship attribution for Bengali blogs. (2016).
- A supervised learning approach for authorship attribution of Bengali literary texts. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 16, 4 (2017), 1–15.
- Rahul Pramanik and Soumen Bag. 2020. Segmentation-based recognition system for handwritten Bangla and Devanagari words using conventional classification and transfer learning. IET Image Processing 14, 5 (2020), 959–972.
- Improving language understanding by generative pre-training.
- Character-level and multi-channel convolutional neural networks for large-scale authorship attribution. arXiv preprint arXiv:1609.06686 (2016).
- Topic or style? exploring the most useful features for authorship attribution. In Proceedings of the 27th International Conference on Computational Linguistics. 343–353.
- Sagor Sarker. 2020. BanglaBERT: Bengali Mask Language Model for Bengali Language Understading. https://github.com/sagorbrur/bangla-bert
- Authorship attribution with topic models. Computational Linguistics 40, 2 (2014), 269–310.
- Convolutional neural networks for authorship attribution of short texts. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 669–674.
- Leslie N. Smith. 2015. No More Pesky Learning Rate Guessing Games. CoRR (2015).
- Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology 60, 3 (2009), 538–556.
- Kalaivani Sundararajan and Damon Woodard. 2018. What represents “style” in authorship attribution?. In Proceedings of the 27th International Conference on Computational Linguistics. 2814–2822.
- Attention is all you need. In Advances in neural information processing systems. 5998–6008.
- Regularization of neural networks using dropconnect. In International conference on machine learning. 1058–1066.
- Wikipedia. 2021. Bengali Language. https://en.wikipedia.org/wiki/Bengali_language
- Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
- Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems. 5753–5763.
- Authorship identification from unstructured texts. Knowledge-Based Systems 66 (2014), 99–111.