A Novel Contrastive Learning Method for Clickbait Detection on RoCliCo: A Romanian Clickbait Corpus of News Articles (2310.06540v1)
Abstract: To increase revenue, news websites often resort to using deceptive news titles, luring users into clicking on the title and reading the full news. Clickbait detection is the task that aims to automatically detect this form of false advertisement and avoid wasting the precious time of online users. Despite the importance of the task, to the best of our knowledge, there is no publicly available clickbait corpus for the Romanian language. To this end, we introduce a novel Romanian Clickbait Corpus (RoCliCo) comprising 8,313 news samples which are manually annotated with clickbait and non-clickbait labels. Furthermore, we conduct experiments with four machine learning methods, ranging from handcrafted models to recurrent and transformer-based neural networks, to establish a line-up of competitive baselines. We also carry out experiments with a weighted voting ensemble. Among the considered baselines, we propose a novel BERT-based contrastive learning model that learns to encode news titles and contents into a deep metric space such that titles and contents of non-clickbait news have high cosine similarity, while titles and contents of clickbait news have low cosine similarity. Our data set and code to reproduce the baselines are publicly available for download at https://github.com/dariabroscoteanu/RoCliCo.
- Amol Agrawal. 2016. Clickbait detection using deep learning. In Proceedings of 2nd International Conference on Next Generation Computing Technologies (NGCT), pages 268–272.
- Jonathan Anderson. 1983. LIX and RIX: Variations on a Little-known Readability Index. Journal of Reading, 26(6):490–496.
- “8 amazing secrets for getting more clicks”: Detecting clickbaits in news streams using article informality. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 30.
- Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
- From clickbait to fake news detection: An approach based on detecting the stance of headlines to articles. In Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism, pages 84–89, Copenhagen, Denmark. Association for Computational Linguistics.
- A Literature Review of NLP Approaches to Fake News Detection and Their Applicability to Romanian Language News Analysis. Revista Transilvania, 10.
- Findings of the VarDial evaluation campaign 2021. In Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pages 1–11.
- Davide Chicco. 2021. Siamese neural networks: An overview. In Hugh Cartwright, editor, Artificial Neural Networks, pages 73–94. Springer US, New York, NY.
- Meri Coleman and T. L. Liau. 1975. A computer readability formula designed for machine scoring. Journal of Applied Psychology, 60(2):283–284.
- A new language independent strategy for clickbait detection. In Proceedings of 2020 International Conference on Software, Telecommunications and Computer Networks (SoftCOM), pages 1–6.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 4171–4186.
- Similarity-aware deep attentive model for clickbait detection. In Advances in Knowledge Discovery and Data Mining: 23rd Pacific-Asia Conference (PAKDD), pages 56–69. Springer.
- The birth of Romanian BERT. In Findings of the Association for Computational Linguistics (EMNLP), pages 4324–4328.
- Clickbait Headline Detection in Indonesian News Sites using Multilingual Bidirectional Encoder Representations from Transformers (M-BERT). arXiv preprint arXiv:2102.01497.
- Şura Genç and Elif Surer. 2023. ClickbaitTR: Dataset for clickbait detection from Turkish news sites and social media with a comparative analysis via machine learning algorithms. Journal of Information Science, 49(2):480–499.
- Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition. In Proceedings of 15th International Conference on Artificial Neural Networks (ICANN), pages 799–804, Berlin, Heidelberg. Springer Berlin Heidelberg.
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, 9(8):1735–1780.
- Predicting clickbait strength in online social media. In Proceedings of the 28th International Conference on Computational Linguistics (COLING), pages 4835–4846, Barcelona, Spain (Online).
- Vivek Kaushal and Kavita Vemuri. 2020. Clickbait in Hindi News Media: A Preliminary Study. In Proceedings of the 17th International Conference on Natural Language Processing (ICON), pages 85–89, Indian Institute of Technology Patna, Patna, India. NLP Association of India (NLPAI).
- Diederik P. Kingma and Jimmy Ba. 2017. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Praphan Klairith and Sansiri Tanachutiwat. 2018. Thai clickbait detection algorithms using natural language processing with machine learning techniques. In Proceedings of 2018 International Conference on Engineering, Applied Sciences, and Technology (ICEAST), pages 1–4. IEEE.
- Identifying clickbait: A multi-strategy approach using neural networks. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR), pages 1225–1228.
- Clickbait detection on wechat: A deep model integrating semantic and syntactic information. Knowledge-Based Systems, 245:108605.
- Hybridizing metric learning and case-based reasoning for adaptable clickbait detection. Applied Intelligence, 48:2967–2982.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Crowdsourcing a large corpus of clickbait on Twitter. In Proceedings of the 27th International Conference on Computational Linguistics (COLING), pages 1498–1507, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Clickbait detection. In Advances in Information Retrieval (ECIR), pages 810–817.
- Abinash Pujahari and Dilip Singh Sisodia. 2021. Clickbait detection using multiple categorisation techniques. Journal of Information Science, 47(1):118–128.
- Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL): System Demonstrations.
- SaRoCo: Detecting satire in a novel Romanian corpus of news articles. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1073–1079, Online. Association for Computational Linguistics.
- BaitBuster: a clickbait identification framework. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 32.
- Stochastic Class-Based Hard Example Mining for Deep Metric Learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7244–7252. IEEE.
- Veronika Vincze and Martina Katalin Szabó. 2020. Automatic detection of Hungarian clickbait and entertaining fake news. In Proceedings of the 3rd International Workshop on Rumours and Deception in Social Media (RDSM), pages 58–69, Barcelona, Spain (Online). Association for Computational Linguistics.
- Clickbait detection based on word embedding models. In Proceedings of the 12th International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS), pages 557–564. Springer.
- Clickbait detection with style-aware title modeling and co-attention. In Proceedings of the 19th Chinese National Conference on Computational Linguistics (CCL), pages 1143–1154, Haikou, China. Chinese Information Processing Society of China.
- Clickbait detection via contrastive variational modelling of text and label. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI), pages 4475–4481.