Papers
Topics
Authors
Recent
2000 character limit reached

HinTel-AlignBench: Multimodal AI for Hindi & Telugu

Updated 26 November 2025
  • The paper introduces a scalable, semi-automated pipeline for curating high-fidelity multimodal datasets in Hindi and Telugu, addressing longstanding gaps in Indian language benchmarks.
  • It integrates diverse vision–language tasks with systematically aligned English samples, enabling robust cross-lingual performance diagnostics and cultural context evaluation.
  • Empirical results reveal significant performance drops when transitioning from English to Indian languages, underscoring the need for culturally anchored AI models.

HinTel-AlignBench is a framework and benchmark suite designed for rigorous evaluation of Vision-LLMs (VLMs) in Hindi and Telugu, with systematically aligned English samples. It targets the quantification of cross-lingual and cross-cultural performance of both open-weight and closed-source models, addressing a longstanding deficit in high-quality benchmarks for Indian languages in the context of multimodal AI. HinTel-AlignBench introduces pipeline innovations for dataset curation, covers a broad range of vision–language tasks, and enables fine-grained diagnostics of model failures and cross-lingual regressions (Chigrupaatii et al., 19 Nov 2025).

1. Motivation and Design Objectives

The Indian linguistic landscape features 1.5 billion speakers and 120+ major languages, yet vision–language benchmarks have historically focused on English or other high-resource European languages. Existing multilingual VQA benchmarks suffer from four principal deficiencies: reliance on error-prone automated translations, restricted task and domain scope (e.g., only real-world VQA), minimal per-language sample sizes (often fewer than 300 QA pairs), and lack of culturally anchored native content. HinTel-AlignBench was established to address these gaps by providing:

  • A scalable, semi-automated pipeline for evaluation set creation in Hindi and Telugu with English alignment.
  • The largest multi-domain VQA benchmark for these languages, integrating adapted English datasets and original native Indic sets.
  • Systematic cross-lingual model analysis with error mode categorization and performance quantification.

These design choices set a new standard for evaluating VLMs in the context of under-resourced, culturally rich languages (Chigrupaatii et al., 19 Nov 2025).

2. Dataset Creation Pipeline

HinTel-AlignBench uses a three-phase semi-automated framework to balance efficiency, scale, and linguistic fidelity:

2.1 Initial Generation

  • For English-origin data (VQAv2, RealWorldQA, CLEVR-Math), automated translation of both questions and answers is performed using four MT systems (IndicTrans, Google, AWS, Azure). A held-out 50-sample subset is used to select Azure (Hindi) and AWS (Telugu) based on highest BLEU and chrF scores.
  • For the VAANI cultural dataset, a text-only LLM (GPT 4.1) generates Hindi/Telugu MCQs from image captions.

2.2 Automated Filtering

  • VAANI-H/T (Hindi/Telugu) sets are filtered using a GPT 4.1 judge to eliminate items answerable from text alone, thus ensuring genuine multimodal dependency.

2.3 Human Verification

  • All samples undergo native-speaker review for semantic accuracy, idiomaticity, and fluency. For VQAv2, 79% of machine translations are accepted as-is, with the remainder split between minor and major edits. Manual verification is measured to be 5× faster than original authoring.

This pipeline creates high-fidelity datasets and drastically improves scalability over purely manual construction, while mitigating translation drift and capturing linguistic/cultural nuances absent from automated translation alone.

3. Benchmark Composition

HinTel-AlignBench encompasses adapted English datasets and original Indic datasets for comprehensive domain/task coverage across three languages (EN, HI, TE):

Subset EN HI TE
VQAv2 (OE-VQA) 1000 1000 1000
RealWorldQA (MC) 765 765 765
CLEVR-Math (OE) 1000 1000 1000
JEE-Vision (STEM) 317 192 325
VAANI (Cultural) 1965 945 1020
  • Adapted English Benchmarks: VQAv2 tests open-ended visual question answering (COCO images); RealWorldQA contains MC spatial reasoning; CLEVR-Math assesses compositional visual-mathematical reasoning.
  • Native Indic Sets: JEE-Vision targets diagram-dependent STEM problems from JEE-Advanced (Hindi) and JEE-Mains (Telugu); VAANI contains culturally-grounded MCQs probing Indian art, tradition, and festival knowledge.

This selection produces approximately 4,000 QA items per language, spanning open-ended, multiple-choice, STEM, and cultural domains (Chigrupaatii et al., 19 Nov 2025).

4. Evaluation Protocol and Metrics

Task-specific metrics are tightly defined to enable fair, reproducible cross-lingual and cross-model comparisons.

  • Open-Ended VQA (VQAv2, CLEVR-Math): Uses hybrid Exact-Match (EM) plus LLM-Judge. EM is computed as

EM=1Ni=1N1{a^i=ai}\mathrm{EM} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\{\,\hat a_i = a_i\}

For non-matching responses, an LLM (gpt-4.1-2025-04-14) assesses semantic equivalence.

  • Multiple-Choice (RealWorldQA, VAANI):

Acc=#correct#total\mathrm{Acc} = \frac{\#\text{correct}}{\#\text{total}}

  • JEE-Vision (MC, multi-MC, integer): Employs regex parsing with binary (single/integer), and partial credit (0.25 per correct option) for multi-select.

The regression in model accuracy from English to Hindi (ΔHI\Delta_{HI}) and Telugu (ΔTE\Delta_{TE}) is formally quantified:

ΔHI=AccENAccHI,ΔTE=AccENAccTE\Delta_{HI} = \mathrm{Acc}_{EN} - \mathrm{Acc}_{HI},\quad \Delta_{TE} = \mathrm{Acc}_{EN} - \mathrm{Acc}_{TE}

This protocol enables robust, granular measurement of cross-lingual and cross-modal performance, critical in multilingual vision–language analysis (Chigrupaatii et al., 19 Nov 2025).

5. Empirical Results and Error Analysis

Extensive benchmarking reveals a consistent regression in VLM performance when moving from English to Indian languages for all evaluated tasks and models.

  • Aggregate Regression: Average drop of ΔHI8.3\overline{\Delta}_{HI} \approx 8.3 points for Hindi, ΔTE5.5\overline{\Delta}_{TE} \approx 5.5 points for Telugu. On aligned subsets, Hindi–Telugu differences are <1<1 point, confirming systemic regression from English.
  • Model-Specific Drops: GPT-4.1 and Gemini-2.5-Flash see drops of $3.8$–$8.6$ points in HI/TE, with highest task-level deficits on CLEVR-Math and RealWorldQA.
  • Error Mode Taxonomy: On VAANI-T, GPT-4.1’s errors are classified as follows:

    1. Lack of Indian Knowledge: 16–17%
    2. Visual Grounding Errors: 49–50%
    3. Visual Perception Failures: 18–19%
    4. Failure to Contextualize Culturally: 15–16%

Error frequencies are highly consistent between English and Telugu, indicating that multimodal grounding, rather than language-specific knowledge, is the dominant bottleneck.

6. Dataset and Model Recommendations

Concrete areas for enhancement are identified:

  • Dataset Quality: For VAANI-H/T, improved MCQ distractors (e.g., via Multi-Binary Accuracy) and richer LLM-based filtering. For JEE-Vision, expansion toward additional diagram-rich domains (e.g., JEE-Mains).

  • Evaluation Tools: Replacement of paid LLM evaluators with open-weight judges for open-source accessibility; expansion of evaluation methodology to video-based tasks covering cultural variety.
  • Model Architecture: Embedding explicit stepwise reasoning via Chain-of-Thought (CoT) demonstrates a higher performance boost in English (+3.78+3.78) than Hindi (+0.79+0.79) on reasoning tasks; this suggests a training corpus bias effect. Further model directions include enrichment with native script/cultural corpora and modular cultural knowledge adapters for enhancing region-specific grounding.

7. Impact and Future Work

HinTel-AlignBench provides the most comprehensive and methodologically rigorous evaluation resource for vision–language modeling in Hindi and Telugu, facilitating performance tracking on native, aligned, and culturally nuanced tasks. By releasing a reproducible pipeline and detailed empirics, HinTel-AlignBench sets a foundation for community-driven extension across India’s 122+ languages and multimodal domains such as video.

This benchmark highlights the persistent gap in cross-lingual vision–language understanding, centralizes Indian multimodal challenges, and points to the necessity of culturally anchored evaluation for equitable AI development in low-resource settings (Chigrupaatii et al., 19 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to HinTel-AlignBench.