TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu (2404.11349v1)
Abstract: News headline generation is a crucial task in increasing productivity for both the readers and producers of news. This task can easily be aided by automated News headline-generation models. However, the presence of irrelevant headlines in scraped news articles results in sub-optimal performance of generation models. We propose that relevance-based headline classification can greatly aid the task of generating relevant headlines. Relevance-based headline classification involves categorizing news headlines based on their relevance to the corresponding news articles. While this task is well-established in English, it remains under-explored in low-resource languages like Telugu due to a lack of annotated data. To address this gap, we present TeClass, the first-ever human-annotated Telugu news headline classification dataset, containing 78,534 annotations across 26,178 article-headline pairs. We experiment with various baseline models and provide a comprehensive analysis of their results. We further demonstrate the impact of this work by fine-tuning various headline generation models using TeClass dataset. The headlines generated by the models fine-tuned on highly relevant article-headline pairs, showed about a 5 point increment in the ROUGE-L scores. To encourage future research, the annotated dataset as well as the annotation guidelines will be made publicly available.
- Alexey Bukhtiyarov and Ilya Gusev. 2020. Advances of transformer-based models for news headline generation. In Artificial Intelligence and Natural Language, pages 54–61, Cham. Springer International Publishing.
- Incongruent headlines: Yet another way to mislead your readers. In Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism, pages 56–61, Copenhagen, Denmark. Association for Computational Linguistics.
- Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Towards leaving no indic language behind: Building monolingual corpora, benchmark and models for indic languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12402–12426.
- William Ferreira and Andreas Vlachos. 2016. Emergent: a novel data-set for stance classification. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 1163–1168. The Association for Computational Linguistics.
- Generating Representative Headlines for News Stories. In Proc. of the the Web Conf. 2020.
- A retrospective analysis of the fake news challenge stance-detection task. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1859–1874, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- XL-sum: Large-scale multilingual abstractive summarization for 44 languages.
- Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543.
- Hooks in the headline: Learning to generate headlines with controlled styles. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 5082–5093. Association for Computational Linguistics.
- Muril: Multilingual representations for indian languages. arXiv preprint arXiv:2103.10730.
- Mukhyansh: A headline generation dataset for Indic languages. In Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation, pages 620–634, Hong Kong, China. Association for Computational Linguistics.
- Dean Pomerleau and Delip Rao. 2017. The fake news challenge: Exploring how artificial intelligence technologies could be leveraged to combat fake news.
- Justus J Randolph. 2005. Free-marginal multirater kappa (multirater k [free]): An alternative to fleiss’ fixed-marginal multirater kappa. Online submission.
- A simple but tough-to-beat baseline for the fake news challenge stance detection task. arXiv preprint arXiv:1707.03264.
- A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389, Lisbon, Portugal. Association for Computational Linguistics.
- Gopichand Kanumolu (6 papers)
- Lokesh Madasu (4 papers)
- Nirmal Surange (7 papers)
- Manish Shrivastava (62 papers)