HinTel-AlignBench: Multimodal AI for Hindi & Telugu
- The paper introduces a scalable, semi-automated pipeline for curating high-fidelity multimodal datasets in Hindi and Telugu, addressing longstanding gaps in Indian language benchmarks.
- It integrates diverse vision–language tasks with systematically aligned English samples, enabling robust cross-lingual performance diagnostics and cultural context evaluation.
- Empirical results reveal significant performance drops when transitioning from English to Indian languages, underscoring the need for culturally anchored AI models.
HinTel-AlignBench is a framework and benchmark suite designed for rigorous evaluation of Vision-LLMs (VLMs) in Hindi and Telugu, with systematically aligned English samples. It targets the quantification of cross-lingual and cross-cultural performance of both open-weight and closed-source models, addressing a longstanding deficit in high-quality benchmarks for Indian languages in the context of multimodal AI. HinTel-AlignBench introduces pipeline innovations for dataset curation, covers a broad range of vision–language tasks, and enables fine-grained diagnostics of model failures and cross-lingual regressions (Chigrupaatii et al., 19 Nov 2025).
1. Motivation and Design Objectives
The Indian linguistic landscape features 1.5 billion speakers and 120+ major languages, yet vision–language benchmarks have historically focused on English or other high-resource European languages. Existing multilingual VQA benchmarks suffer from four principal deficiencies: reliance on error-prone automated translations, restricted task and domain scope (e.g., only real-world VQA), minimal per-language sample sizes (often fewer than 300 QA pairs), and lack of culturally anchored native content. HinTel-AlignBench was established to address these gaps by providing:
- A scalable, semi-automated pipeline for evaluation set creation in Hindi and Telugu with English alignment.
- The largest multi-domain VQA benchmark for these languages, integrating adapted English datasets and original native Indic sets.
- Systematic cross-lingual model analysis with error mode categorization and performance quantification.
These design choices set a new standard for evaluating VLMs in the context of under-resourced, culturally rich languages (Chigrupaatii et al., 19 Nov 2025).
2. Dataset Creation Pipeline
HinTel-AlignBench uses a three-phase semi-automated framework to balance efficiency, scale, and linguistic fidelity:
2.1 Initial Generation
- For English-origin data (VQAv2, RealWorldQA, CLEVR-Math), automated translation of both questions and answers is performed using four MT systems (IndicTrans, Google, AWS, Azure). A held-out 50-sample subset is used to select Azure (Hindi) and AWS (Telugu) based on highest BLEU and chrF scores.
- For the VAANI cultural dataset, a text-only LLM (GPT 4.1) generates Hindi/Telugu MCQs from image captions.
2.2 Automated Filtering
- VAANI-H/T (Hindi/Telugu) sets are filtered using a GPT 4.1 judge to eliminate items answerable from text alone, thus ensuring genuine multimodal dependency.
2.3 Human Verification
- All samples undergo native-speaker review for semantic accuracy, idiomaticity, and fluency. For VQAv2, 79% of machine translations are accepted as-is, with the remainder split between minor and major edits. Manual verification is measured to be 5× faster than original authoring.
This pipeline creates high-fidelity datasets and drastically improves scalability over purely manual construction, while mitigating translation drift and capturing linguistic/cultural nuances absent from automated translation alone.
3. Benchmark Composition
HinTel-AlignBench encompasses adapted English datasets and original Indic datasets for comprehensive domain/task coverage across three languages (EN, HI, TE):
| Subset | EN | HI | TE |
|---|---|---|---|
| VQAv2 (OE-VQA) | 1000 | 1000 | 1000 |
| RealWorldQA (MC) | 765 | 765 | 765 |
| CLEVR-Math (OE) | 1000 | 1000 | 1000 |
| JEE-Vision (STEM) | 317 | 192 | 325 |
| VAANI (Cultural) | 1965 | 945 | 1020 |
- Adapted English Benchmarks: VQAv2 tests open-ended visual question answering (COCO images); RealWorldQA contains MC spatial reasoning; CLEVR-Math assesses compositional visual-mathematical reasoning.
- Native Indic Sets: JEE-Vision targets diagram-dependent STEM problems from JEE-Advanced (Hindi) and JEE-Mains (Telugu); VAANI contains culturally-grounded MCQs probing Indian art, tradition, and festival knowledge.
This selection produces approximately 4,000 QA items per language, spanning open-ended, multiple-choice, STEM, and cultural domains (Chigrupaatii et al., 19 Nov 2025).
4. Evaluation Protocol and Metrics
Task-specific metrics are tightly defined to enable fair, reproducible cross-lingual and cross-model comparisons.
- Open-Ended VQA (VQAv2, CLEVR-Math): Uses hybrid Exact-Match (EM) plus LLM-Judge. EM is computed as
For non-matching responses, an LLM (gpt-4.1-2025-04-14) assesses semantic equivalence.
- Multiple-Choice (RealWorldQA, VAANI):
- JEE-Vision (MC, multi-MC, integer): Employs regex parsing with binary (single/integer), and partial credit (0.25 per correct option) for multi-select.
The regression in model accuracy from English to Hindi () and Telugu () is formally quantified:
This protocol enables robust, granular measurement of cross-lingual and cross-modal performance, critical in multilingual vision–language analysis (Chigrupaatii et al., 19 Nov 2025).
5. Empirical Results and Error Analysis
Extensive benchmarking reveals a consistent regression in VLM performance when moving from English to Indian languages for all evaluated tasks and models.
- Aggregate Regression: Average drop of points for Hindi, points for Telugu. On aligned subsets, Hindi–Telugu differences are point, confirming systemic regression from English.
- Model-Specific Drops: GPT-4.1 and Gemini-2.5-Flash see drops of $3.8$–$8.6$ points in HI/TE, with highest task-level deficits on CLEVR-Math and RealWorldQA.
- Error Mode Taxonomy: On VAANI-T, GPT-4.1’s errors are classified as follows:
- Lack of Indian Knowledge: 16–17%
- Visual Grounding Errors: 49–50%
- Visual Perception Failures: 18–19%
- Failure to Contextualize Culturally: 15–16%
Error frequencies are highly consistent between English and Telugu, indicating that multimodal grounding, rather than language-specific knowledge, is the dominant bottleneck.
6. Dataset and Model Recommendations
Concrete areas for enhancement are identified:
Dataset Quality: For VAANI-H/T, improved MCQ distractors (e.g., via Multi-Binary Accuracy) and richer LLM-based filtering. For JEE-Vision, expansion toward additional diagram-rich domains (e.g., JEE-Mains).
- Evaluation Tools: Replacement of paid LLM evaluators with open-weight judges for open-source accessibility; expansion of evaluation methodology to video-based tasks covering cultural variety.
- Model Architecture: Embedding explicit stepwise reasoning via Chain-of-Thought (CoT) demonstrates a higher performance boost in English () than Hindi () on reasoning tasks; this suggests a training corpus bias effect. Further model directions include enrichment with native script/cultural corpora and modular cultural knowledge adapters for enhancing region-specific grounding.
7. Impact and Future Work
HinTel-AlignBench provides the most comprehensive and methodologically rigorous evaluation resource for vision–language modeling in Hindi and Telugu, facilitating performance tracking on native, aligned, and culturally nuanced tasks. By releasing a reproducible pipeline and detailed empirics, HinTel-AlignBench sets a foundation for community-driven extension across India’s 122+ languages and multimodal domains such as video.
This benchmark highlights the persistent gap in cross-lingual vision–language understanding, centralizes Indian multimodal challenges, and points to the necessity of culturally anchored evaluation for equitable AI development in low-resource settings (Chigrupaatii et al., 19 Nov 2025).