Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 63 tok/s

Gemini 2.5 Pro 44 tok/s Pro

GPT-5 Medium 31 tok/s Pro

GPT-5 High 32 tok/s Pro

GPT-4o 86 tok/s Pro

Kimi K2 194 tok/s Pro

GPT OSS 120B 445 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages (2509.21294v1)

Published 25 Sep 2025 in cs.CL

Abstract: Developing AI systems that operate effectively across languages while remaining culturally grounded is a long-standing challenge, particularly in low-resource settings. Synthetic data provides a promising avenue, yet its effectiveness in multilingual and multicultural contexts remains underexplored. We investigate the creation and impact of synthetic, culturally contextualized datasets for Indian languages through a bottom-up generation strategy that prompts large open-source LLMs (>= 235B parameters) to ground data generation in language-specific Wikipedia content. This approach complements the dominant top-down paradigm of translating synthetic datasets from high-resource languages such as English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages, encompassing diverse reasoning and generative tasks with an emphasis on long-context, multi-turn capabilities, and alignment with Indian cultural contexts. A comprehensive evaluation incorporating both automated metrics and human annotation across 10k assessments indicates that generated data is high quality; though, human evaluation highlights areas for further improvement. Additionally, we perform downstream evaluations by fine-tuning models on our dataset and assessing the performance across 15 diverse multilingual datasets. Models trained on Updesh consistently achieve significant gains on generative tasks and remain competitive on multiple-choice style NLU tasks. Notably, relative improvements are most pronounced in low and medium-resource languages, narrowing their gap with high-resource languages. These findings provide empirical evidence that effective multilingual AI requires multi-faceted data curation and generation strategies that incorporate context-aware, culturally grounded methodologies.

Summary

The paper introduces a comprehensive framework for generating culturally grounded synthetic data using translation-based, backtranslation, and retrieval-augmented strategies.
The paper demonstrates that combining native speaker evaluation with LLM-based assessments achieves high data quality and effectively narrows resource gaps in low-resource Indic languages.
The paper reveals that context-aware, culturally sensitive synthetic data can boost generative task performance and improve instruction-following capabilities in multilingual AI models.

Synthetic Data for Multilingual, Multi-cultural AI: Empirical Insights from Indic Languages

Introduction and Motivation

The paper addresses the persistent challenge of building AI systems that are both multilingual and culturally grounded, with a particular focus on low-resource languages. It critiques the prevailing English-centric paradigm in data curation, fine-tuning, and evaluation, arguing that such practices perpetuate global power imbalances and fail to capture linguistic and cultural diversity. The authors propose synthetic data generation as a viable strategy to supplement scarce resources, but emphasize that its effectiveness in multilingual and multicultural contexts is underexplored. The work introduces a comprehensive framework for synthetic data generation, quality assessment, and downstream evaluation, instantiated through the creation of the Updesh dataset—9.5M instruction-following samples across 13 Indian languages.

Framework for Multilingual, Multicultural Synthetic Data Generation

The framework delineates key factors for effective synthetic data generation:

Base Model Capability: Selection of LLMs with demonstrated proficiency in target languages, considering licensing, cost, and openness.
Seed Data Selection: Prioritization of tasks with cultural relevance and linguistic diversity, involving native speakers in the process.
Data Generation Strategies: Comparison of translation-based, backtranslation, and retrieval-augmented generation. The latter leverages native language Wikipedia content to ensure cultural and linguistic grounding.
Quality Metrics: Multi-dimensional evaluation including language correctness, linguistic acceptability, cultural appropriateness, and safety/bias.
Downstream Evaluation: Use of diverse, non-translated benchmarks covering all target languages and domains, with attention to benchmark contamination.
Native Speaker Involvement: Ensuring consent, privacy, and data sovereignty, with native speakers engaged in both seed selection and evaluation.

Updesh Dataset Construction

Updesh comprises two complementary subsets:

Reasoning Data: Translation of high-quality reasoning datasets (OrcaAgent-Instruct, OrcaMath) into 13 Indic languages using Llama-3.1-405B-Instruct, with rigorous quality filtering.
Open-Domain Generative Data: Generation of culturally contextualized data using Qwen3-235B-A22B, grounded in Wikipedia content. The process involves multi-phase LLM inference for tasks such as multi-hop QA, creative writing, and multi-turn dialogue, with explicit curation of cultural artifacts.

Automated filtering (IndicLID, repetition ratio) ensures high data integrity, with drop rates below 2% for most subsets. Notably, the dataset emphasizes long-context and multi-turn capabilities, addressing a gap in existing resources.

Data Quality Analysis

Quality assessment combines LLM-based evaluation (GPT-4o) and human annotation by native speakers, using stratified sampling and detailed rubrics across multiple dimensions (instruction adherence, fluency, narrative coherence, answer adequacy, persona consistency, etc.). Human evaluators assigned only 0.27% zero scores across 10,000 assessments, indicating high overall quality.

Inter-annotator agreement analysis reveals substantial variance across metrics:

Figure 1: Human LLM-judge agreement across evaluation metrics, revealing differences across dimensions.

Agreement is robust for objective criteria (toxicity, problematic content), but deteriorates for culturally and linguistically nuanced assessments (linguistic plausibility, persona consistency in long dialogues). This highlights limitations of current LLM-judges in evaluating culturally sensitive content.

Distributional analysis of LLM and human scores across tasks and languages further elucidates these trends:

Figure 2: LLM evaluations across 5 synthetically generated tasks.

Figure 3: Expert human evaluations across 5 synthetically generated tasks.

Confusion matrices and agreement analysis per task/language provide granular insights into areas of disagreement:

Figure 4: Confusion matrices showing agreement between human and LLM evaluators.

Figure 5: Agreement between human and LLM evaluators per task and language respectively.

Downstream Task Evaluation

Fine-tuning experiments were conducted on Llama-3.1-8B and Phi4-14B, comparing Updesh against Aya-Collection, IndicAlign, and Bactrian-X baselines. Evaluation spans NLU (multiple-choice), NLG (translation, summarization), and instruction-following (IFEval, IFBench) tasks, using both native and translated benchmarks.

NLU and NLG Performance

Updesh yields significant improvements on NLG tasks, with Llama-Updesh and Phi4-Updesh achieving the highest ChrF scores across translation and summarization. NLU results are more nuanced: Phi4-Updesh attains best overall scores on several benchmarks (MMLU-I, MILU, BoolQ-I, BeleBele, INCL, GlobalMMLU), but no single configuration dominates across all NLU tasks.

Language-wise analysis demonstrates that Updesh delivers the largest relative gains in low- and mid-resource languages, effectively narrowing the gap to high-resource languages:

Figure 6: NLU and NLG task performance grouped by language–resource class taxonomy from Joshi et al. (2020). Updesh yields the largest relative gains in low/mid-resource languages.

Instruction-Following Capabilities

On instruction-following benchmarks, Updesh provides robust performance with minimal catastrophic forgetting, especially for Phi4. Fine-tuning generally degrades Llama's performance, but Updesh shows smaller drops compared to other baselines. The results underscore the importance of training data format and distributional alignment with downstream tasks.

Discussion

Quality Evaluation: LLM-based evaluators are insufficient for nuanced multilingual and multicultural quality assessment, necessitating scalable human-in-the-loop protocols.
Task-Type Sensitivity: Updesh-trained models excel in NLG tasks due to long-context, generative training data, while NLU gains are attenuated by format mismatches.
Distributional Mismatch: The composition shift in Updesh (long-form, multi-step, generative) amplifies NLG gains but limits NLU improvements, highlighting the need for task-aligned data curation.
Resource Gap Bridging: Updesh demonstrates that context-aware, culturally grounded synthetic data can effectively bridge resource gaps, especially in low-resource languages.

Ethical Considerations

The paper details institutional oversight, annotator demographics, and rigorous quality assurance protocols. Native speakers were involved in annotation, with explicit attention to privacy, consent, and data sovereignty.

Conclusion

This work provides a comprehensive empirical paper of synthetic data generation for multilingual, multicultural AI, with a focus on Indic languages. The Updesh dataset, constructed via a bottom-up, culturally grounded pipeline, demonstrates that synthetic data can substantially improve generative task performance and reduce resource disparities. However, the results indicate that no single strategy is universally optimal; effective multilingual AI development requires multi-faceted, context-aware data curation and evaluation methodologies. The public release of Updesh and associated protocols will facilitate transparent, reproducible research in this domain.