Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu (2404.11349v1)

Published 17 Apr 2024 in cs.CL

Abstract: News headline generation is a crucial task in increasing productivity for both the readers and producers of news. This task can easily be aided by automated News headline-generation models. However, the presence of irrelevant headlines in scraped news articles results in sub-optimal performance of generation models. We propose that relevance-based headline classification can greatly aid the task of generating relevant headlines. Relevance-based headline classification involves categorizing news headlines based on their relevance to the corresponding news articles. While this task is well-established in English, it remains under-explored in low-resource languages like Telugu due to a lack of annotated data. To address this gap, we present TeClass, the first-ever human-annotated Telugu news headline classification dataset, containing 78,534 annotations across 26,178 article-headline pairs. We experiment with various baseline models and provide a comprehensive analysis of their results. We further demonstrate the impact of this work by fine-tuning various headline generation models using TeClass dataset. The headlines generated by the models fine-tuned on highly relevant article-headline pairs, showed about a 5 point increment in the ROUGE-L scores. To encourage future research, the annotated dataset as well as the annotation guidelines will be made publicly available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. Alexey Bukhtiyarov and Ilya Gusev. 2020. Advances of transformer-based models for news headline generation. In Artificial Intelligence and Natural Language, pages 54–61, Cham. Springer International Publishing.
  2. Incongruent headlines: Yet another way to mislead your readers. In Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism, pages 56–61, Copenhagen, Denmark. Association for Computational Linguistics.
  3. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
  4. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  5. Towards leaving no indic language behind: Building monolingual corpora, benchmark and models for indic languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12402–12426.
  6. William Ferreira and Andreas Vlachos. 2016. Emergent: a novel data-set for stance classification. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 1163–1168. The Association for Computational Linguistics.
  7. Generating Representative Headlines for News Stories. In Proc. of the the Web Conf. 2020.
  8. A retrospective analysis of the fake news challenge stance-detection task. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1859–1874, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
  9. XL-sum: Large-scale multilingual abstractive summarization for 44 languages.
  10. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543.
  11. Hooks in the headline: Learning to generate headlines with controlled styles. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 5082–5093. Association for Computational Linguistics.
  12. Muril: Multilingual representations for indian languages. arXiv preprint arXiv:2103.10730.
  13. Mukhyansh: A headline generation dataset for Indic languages. In Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation, pages 620–634, Hong Kong, China. Association for Computational Linguistics.
  14. Dean Pomerleau and Delip Rao. 2017. The fake news challenge: Exploring how artificial intelligence technologies could be leveraged to combat fake news.
  15. Justus J Randolph. 2005. Free-marginal multirater kappa (multirater k [free]): An alternative to fleiss’ fixed-marginal multirater kappa. Online submission.
  16. A simple but tough-to-beat baseline for the fake news challenge stance detection task. arXiv preprint arXiv:1707.03264.
  17. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389, Lisbon, Portugal. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Gopichand Kanumolu (6 papers)
  2. Lokesh Madasu (4 papers)
  3. Nirmal Surange (7 papers)
  4. Manish Shrivastava (62 papers)

Summary

We haven't generated a summary for this paper yet.