Hindi Text Summarization Datasets
- Hindi text summarization datasets are curated corpora pairing articles with summaries, enabling robust development and evaluation of summarization models.
- They encompass diverse formats—from news and government statements to code-switched conversations—with both extractive and abstractive summary types.
- State-of-the-art datasets employ advanced filtering, quality assurance, and cross-lingual alignment techniques to ensure high-quality benchmarks.
Hindi text summarization datasets are curated corpora consisting of Hindi documents paired with corresponding summaries, constructed to support the development, benchmarking, and deployment of monolingual and cross-lingual text summarization systems. These datasets exhibit considerable diversity in size, annotation protocol, summary type (extractive vs. abstractive), genre, and format, with recent releases covering news, parliamentary discourse, open-domain conversation, headline generation, and cross-lingual document alignment. The last five years have seen a substantial expansion in both scale and methodological rigor, with improved filtering, quality assurance, and public availability forming the backbone of state-of-the-art resources for Hindi text summarization.
1. Major Hindi Summarization Datasets: Survey and Taxonomy
Hindi summarization resources fall into several broad classes: news-article corpora with extractive or abstractive summaries, parallel headline–body corpora, code-switched conversational sets, and synthetic datasets created via translation of English benchmarks.
News and Article Summarization Corpora
- ILSUM (2022, 2023): Derived from major online Hindi news portals (e.g., indiatvnews.com), these datasets contain ≈8,000 article–summary pairs with standardized CSV formatting, comprised of raw article text and a “summary” field. For the 2022 release, gold summaries are strictly extractive, with annotators instructed to select the most salient sentence. The 2023 release maintains this structure but applies rigorous filtering (removing empty, duplicate, prefix, and low-compression pairs) to yield a high-quality “filtered” subset (5,390 pairs from 7,957 originally) (Urlana et al., 2023, Tangsali et al., 2022).
- Mukhyansh (2023): A large-scale headline-generation dataset comprising 600,623 Hindi article–headline pairs post-filtering, sourced from eight major Hindi news portals. Duplicates, prefix-copied pairs, and short/low-content samples are removed. Article lengths average 14.5 sentences (303 tokens); headlines average 13.5 tokens (Madasu et al., 2023).
- PMIndiaSum (2023): Harvested from the Prime Minister of India’s official website, this dataset provides 4,936 Hindi article–headline pairs with explicit parallelism to 13 other Indian languages. Domain is restricted to government news and statements; each body-headline pair is cleaned, tokenized, and filtered for language conformity (Urlana et al., 2023).
- CrossSum-News-Aligned (2023): This corpus aligns news articles from Indian websites with video summaries (YouTube descriptions), resulting in 6,853 monolingual hi–hi and 7,306 cross-lingual en–hi pairs, with elaborate filtering (embedding similarity, unigram overlap, time window) to ensure alignment (Bhatnagar et al., 2023).
Synthetic and Cross-Lingual Resources
- Hindi XSUM (2026): This resource comprises over 226,000 Hindi document–summary pairs generated by translating the English XSUM dataset using a multi-stage pipeline (neural MT, auto-correction, LLM reranking, and human post-editing). Quality is validated using COMET, BERTScore F₁, and manual annotation for edge cases (Katwe et al., 4 Jan 2026).
- GupShup (2021): A code-switched conversational summarization corpus, GupShup includes 6,831 Hindi–English conversations with parallel English and code-switched summaries, created via manual translation of the SAMSum set into Roman-script Hindi–English (Mehnaz et al., 2021).
- iNLTK and Saaranshak (surveyed in (Sinha et al., 2022)): Early Hindi-specific corpora with parallel news articles and summaries (iNLTK), or multi-document ontology-driven summaries (Saaranshak). Public access and metadata are lacking for Saaranshak; iNLTK is available but poorly documented.
Summary of Key Public Corpora
| Dataset | Domain | Size (pairs) | Summary Type | Public Access |
|---|---|---|---|---|
| ILSUM (2022/3) | News | ~8k | Extractive (1 sent) | Yes |
| Mukhyansh | News | 600k | Headline | Yes |
| PMIndiaSum | Govt news | 4.9k | Headline | Yes |
| XSUM (Hindi) | News (BBC) | 226k | Abstractive (1 sent) | Yes |
| GupShup | Conversation | 6.8k | Abstractive, mixed | On request |
| CrossSum | News/videos | 7k+ | Abstractive | Yes |
| iNLTK | News | n/a (undoc.) | Abstractive | Yes |
| Saaranshak | Multi-domain | n/a | Ontology-based | No |
2. Data Acquisition, Filtering, and Quality Assurance Strategies
Acquisition and preprocessing protocols vary by dataset, with several best practices emerging:
- Automated Crawling: Most datasets are derived via web crawling—ILSUM and Mukhyansh from major news portals, PMIndiaSum from government sites, CrossSum-News-Aligned using news and YouTube video crawlers.
- Filtering: Rigorous post-crawl filters remove empties, duplicates, prefix-copied summaries, “multi-article” concatenations, short text, and, for some datasets, those failing a summary–article compression ratio (e.g., summary ≥ 50% of article length).
- Alignment: Cross-lingual datasets leverage date windows, unigram overlap (including bilingual dictionary translation), and embedding-based cosine similarity thresholds (e.g., ≥0.70 for content and titles) (Bhatnagar et al., 2023).
- Human-in-the-loop: For high-quality synthetic sets, problematic pairs (e.g., COMET < 0.2 or TER > 100%) undergo manual review and correction (≈5% in Hindi XSUM) (Katwe et al., 4 Jan 2026).
- Language and Content Checks: Unicode script checks, deduplication, and removal of code-switched or mixed-language outliers are common, especially in PMIndiaSum (Urlana et al., 2023).
3. Structural and Linguistic Characteristics
Dataset statistics reveal consistent patterns for the Hindi summarization task but diverge by genre and target summary form.
News Articles and Headlines
- Articles: Typically 9–18 sentences (mean 18.1 for ILSUM, 14.5 for Mukhyansh), 150–600 tokens (Mukhyansh mean ≈303, ILSUM ≈553), with a broad range (17–5,034 tokens for ILSUM) (Urlana et al., 2023, Madasu et al., 2023).
- Summaries / Headlines: One sentence (extractive for ILSUM: 40.2 tokens mean, strictly extractive), 13–15 tokens for headlines (Mukhyansh, PMIndiaSum), mean summary character length ≈80 (XSUM-Hindi) (Madasu et al., 2023, Katwe et al., 4 Jan 2026, Urlana et al., 2023).
- Abstractiveness: Mukhyansh headlines demonstrate high novel n-gram content (20% unigram, 81.3% 4-gram), in contrast to highly extractive ILSUM where copying is prevalent and automatable (Madasu et al., 2023).
Conversational and Code-Switched Data
- GupShup: Dialogues average 347.8 words, 11.2 utterances; code-switch density is high (58.86% utterances mixed), with code-mixing quantified via , , and I-index metrics. Summaries are 1–3 sentences, mirroring the format of the English SAMSum source (Mehnaz et al., 2021).
Cross-Lingual/Synthetic
- Hindi XSUM: Documents inherit XSUM’s diversity—226k+ news articles, single-sentence abstractive summaries, broad topic coverage (World, Business, Politics, etc.) (Katwe et al., 4 Jan 2026).
- CrossSum-News-Aligned: Articles range 50–2,000 words (modal 200–400), summaries 20–200 words (typical 40–80); sentence counts 10–25 in articles, 3–5 in summaries (Bhatnagar et al., 2023).
4. Annotation Protocols and Evaluation Benchmarks
Annotation strategies and benchmark reporting show substantial variation:
- Extractive vs. Abstractive: ILSUM and PMIndiaSum are extractive (or lead/description-based); Mukhyansh headlines and XSUM Hindi are abstractive, curated for semantic compression and informativeness (Urlana et al., 2023, Madasu et al., 2023, Katwe et al., 4 Jan 2026).
- Quality Control: Mukhyansh and ILSUM filter out trivial prefix-copies; CrossSum and Hindi XSUM apply similarity, coverage, and error rate thresholds, with escalation to LLMs or human editors as needed (Bhatnagar et al., 2023, Katwe et al., 4 Jan 2026).
Evaluation Metrics
- ROUGE-N: Most datasets use ROUGE-1, ROUGE-2, and ROUGE-L for automatic evaluation, with recall or F₁ variants. E.g., Hindi XSUM uses BERTScore F₁, COMET (cross-lingual regression), BLEU-4, and chrF as additional metrics (Katwe et al., 4 Jan 2026).
- Baseline Results: For ILSUM, fine-tuned IndicBART reaches ROUGE-1 ≈ 0.56; in Mukhyansh, SSIB achieves ROUGE-1 41.05 and ROUGE-L 36.18 on headlines; PMIndiaSum’s mBART-50-large delivers ROUGE-L 77.8 on its headline task (Urlana et al., 2023, Madasu et al., 2023, Urlana et al., 2023). GupShup’s best ROUGE-L for code-switched summarization is 39.80 (multi-view model) (Mehnaz et al., 2021).
5. Public Distribution, Licensing, and Use Considerations
Most new Hindi summarization resources are released with open access, frequently under CC BY 4.0. This trend facilitates broad reuse, benchmarking, and extension.
- Data Format: CSV or JSONL (with explicit id, document, summary, and metadata fields) is standard. Mukhyansh and PMIndiaSum include URLs instead of full text to comply with copyright where necessary (Madasu et al., 2023, Urlana et al., 2023).
- Access Points: Official repositories on HuggingFace or GitHub serve as main download channels—Mukhyansh (URL+scraper), PMIndiaSum (all scripts/data), CrossSum-News-Aligned, Hindi XSUM (Madasu et al., 2023, Urlana et al., 2023, Bhatnagar et al., 2023, Katwe et al., 4 Jan 2026).
- Benchmark Integrity: Use of provided splits is recommended to prevent data contamination and ensure comparability. Length normalization and standardized tokenization are required for robust model evaluation (Madasu et al., 2023).
6. Limitations, Challenges, and Prospects
Significant limitations characterize most extant Hindi summarization corpora:
- Domain Bias: Many resources are news-centric (XSUM, ILSUM, Mukhyansh), limiting generalization to other registers such as social media or debate (Urlana et al., 2023, Katwe et al., 4 Jan 2026).
- Quality Issues: Datasets relying on web “descriptions” or lead paragraphs risk trivial copying and inflate corpus size with noisy or overly extractive examples. As an example, 32% of original ILSUM pairs are dropped as low-quality (Urlana et al., 2023).
- Metadata and Transparency: Early corpora lack published statistics and standard splits (iNLTK, Saaranshak). Surveyed work emphasizes the need for better centralization and standardized reporting (Sinha et al., 2022).
- Synthetic Pipeline Risks: Automated translation (XSUM Hindi) introduces morphological and factuality errors that are only partly mitigated by MT/LLM selection and human curation (Katwe et al., 4 Jan 2026).
- Code-Switch Complexity: As seen in GupShup, generating high-quality summaries for code-switched data remains challenging for current models, with significant drops in ROUGE for Hindi–English sentences (Mehnaz et al., 2021).
Recommendations for Future Work
- Expand to multi-domain coverage (parliamentary, social, medical, legal).
- Prefer genuine post hoc annotation over reliance on byline descriptions or headlines.
- Integrate richer error-detection and factuality-checking modules into automated pipelines.
- Mandate the release of full metadata (token distributions, summaries, splits) and relevant benchmarks.
- Encourage the establishment of centralized repositories and recurring shared tasks for Hindi and broad Indic summarization (Sinha et al., 2022).
7. Practical Impact and Research Applications
Hindi summarization datasets underpin a range of NLP tasks:
- Monolingual and Multilingual Model Benchmarks: Provided splits allow for comparison of seq2seq, transformer, and prompting paradigms—IndicBART, mBART, and mT5 architectures consistently outperform simpler RNN-based architectures (e.g., FastText+GRU) (Madasu et al., 2023, Urlana et al., 2023).
- Cross-lingual and Transfer Learning: Synthetic pipelines (XSUM Hindi, CrossSum) enable knowledge transfer from English, lowering barriers for low-resource languages (Katwe et al., 4 Jan 2026, Bhatnagar et al., 2023).
- Downstream Use Cases: Hindi summarization supports information access (mobile news digests, voice assistants), document retrieval, evidence summarization for QA, and headline/title generation. Datasets also facilitate research on code-switch-aware generation and evaluation.
- Open Research Questions: High extractiveness and domain bias remain primary challenges; increasing the proportion of abstractiveness and domain diversity is critical for further progress.
References
- "Indian Language Summarization using Pretrained Sequence-to-Sequence Models" (Urlana et al., 2023)
- "Implementing Deep Learning-Based Approaches for Article Summarization in Indian Languages" (Tangsali et al., 2022)
- "Automatic Data Retrieval for Cross Lingual Summarization" (Bhatnagar et al., 2023)
- "Bridging the Data Gap: Creating a Hindi Text Summarization Dataset from the English XSUM" (Katwe et al., 4 Jan 2026)
- "Mukhyansh: A Headline Generation Dataset for Indic Languages" (Madasu et al., 2023)
- "An Overview of Indian Language Datasets used for Text Summarization" (Sinha et al., 2022)
- "GupShup: An Annotated Corpus for Abstractive Summarization of Open-Domain Code-Switched Conversations" (Mehnaz et al., 2021)
- "PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for Languages in India" (Urlana et al., 2023)