- The paper introduces the LLM Probe framework that rigorously evaluates LLMs on low-resource, morphologically complex languages using Tigrinya as a case study.
- It employs detailed lexicon alignment, POS tagging, morphosyntactic probing, and BLEU-based translation analysis to assess diverse LLM capabilities.
- Findings reveal that sequence-to-sequence models outperform causal LMs in morphosyntactic tasks, exposing distinct lexical and grammatical strengths.
Systematic Evaluation of LLMs for Low-Resource and Morphologically Complex Languages: The LLM Probe Framework
Motivation and Background
The paper "LLM Probe: Evaluating LLMs for Low-Resource Languages" (2603.29517) addresses the lack of systematic evaluation protocols for LLMs operating on low-resource, morphologically rich languages, taking Tigrinya as a representative case. Tigrinya presents significant challenges for NLP because of its complex root-and-pattern morphology, extensive verbal inflection, and rich agreement systems, all coupled with the scarcity of machine-readable resources and foundational NLP tools. The authors identify the critical issue of inflated performance metrics in multilingual LLMs, as standard benchmarks often overlap with LLM training data and do not capture typologically diverse grammatical phenomena.
Framework and Dataset Construction
The authors introduce LLM Probe, a lexicon-based, linguistically calibrated framework for fine-grained evaluation of LLMs. The evaluation centers on four granular competencies:
- Lexical Alignment: Assesses bidirectional word-to-word correspondence to measure the precision and consistency of the model's lexical mapping between English and Tigrinya.
- POS Tagging: Targets token-level syntactic recognition to probe morphosyntactic sensitivity.
- Morphosyntactic Probing: Examines recognition of critical grammatical features like agreement, gender, number, and noun class.
- Translation Fidelity: Evaluates both semantic adequacy and surface fluency using BLEU and token-level metrics.
A notable methodological contribution is the dataset: 7,234 annotated phrase pairs between English and Tigrinya, including 967 multi-word English expressions and 2,000 multi-word Tigrinya expressions. Construction involved a hybrid pipeline utilizing OCR, manual correction, and integration of printed dictionaries curated by Tigrinya-native linguists. Annotation integrity is ensured via high inter-annotator agreement (Cohen’s κ between 0.84–0.91), and the dataset is precisely aligned for bidirectional tasks, encompassing fine-grained morphosyntactic detail.
Experimental Design
The study benchmarks several open-source models from both the causal (Falcon-10B, Gemma-2B/7B, Mistral-7B, Qwen-7B) and sequence-to-sequence (mT5-base, mT5-large, ByT5) families. Each model was evaluated in strictly controlled conditions, with standardized prompts, deterministic decoding (temperature zero), and rigid input formatting.
Task-specific accuracy, BLEU for translation, and detailed confusion matrices are used for quantitative assessment. All experiment artifacts, scripts, and logs are open-sourced, ensuring high reproducibility and transparency.
Empirical Results and Analysis
Sequence-to-sequence multilingual models (ByT5, mT5) consistently outperform causal LMs in morphosyntactic analysis, POS tagging, and translation (BLEU up to 26.5 for mT5-large; ByT5, mT5-base/larger ≥75% on morphosyntactic accuracy). In contrast, causal LMs such as Qwen-7B, Falcon-10B, and Gemma models exhibit decreased morphosyntactic and translation accuracy but yield competitive lexical alignment scores—highlighting decoupled lexical and grammatical competencies across LLM families. Notably, semantic-pragmatic confusion persists, especially for morphologically ambiguous categories, as shown in confusion matrix analyses.
A critical empirical finding is that, despite strong lexical alignment (1.0 accuracy for many models), surface-level translation metrics fail to reveal systematic morphosyntactic errors—underscoring the necessity of targeted probing beyond typical BLEU-based evaluation. The bidirectional structure of the dataset further exposes architectural asymmetries in model behavior when translating in opposite directions, such as distinct error patterns in noun-verb confusion.
Implications for Multilingual NLP
The LLM Probe framework sets a methodological standard for resource construction and benchmarking in typologically diverse, underrepresented languages. By offering linguistically principled, manually verified resources, the approach mitigates training set contamination and overfitting, a persistent concern in evaluating LLMs on low-resource languages. Integration of morphosyntactic probing enables the identification of fundamental deficiencies—such as agreement errors and category confusion—typically missed by corpus-derived or generative metrics.
The framework’s open resources facilitate controlled cross-model and cross-paradigm comparisons, advancing the reproducibility, interpretability, and generalizability of multilingual LLM evaluation. This positions LLM Probe as a template for broader adaptation to other morphologically rich, low-resource scenarios.
Limitations and Future Directions
Despite substantial methodological advances, extension is constrained by the limitations of manual resource curation, domain and register coverage, and the absence of automated tools (e.g., tokenizers, morphological analyzers) for Tigrinya. Proprietary LLMs are not evaluated, and the dataset is not representative of specialized or informal linguistic domains. Prompt-based evaluation, though standardized, may retain sensitivity to minor prompt and context changes. The authors point to the necessity of automated annotation pipelines and expanded linguistic coverage (including dialects, pragmatic features, and domain specificity) as critical for future research.
Conclusion
LLM Probe provides a rigorous, reproducible, and linguistically calibrated framework for the evaluation of LLMs on morphologically rich, low-resource languages. Through comprehensive, high-quality bilingual resources and multi-dimensional probing tasks, this work establishes a foundation for systematic benchmarking and error analysis. The methodology and dataset enable the identification of cross-architectural strengths and weaknesses, revealing that byte-level and multilingual encoder-decoder models (ByT5, mT5) are currently most robust on Tigrinya. The framework’s open-source nature and emphasis on annotation quality and morphosyntactic awareness make it a model for inclusive and equitable advancement in multilingual NLP.
Future research must advance toward automated, scalable annotation, expansion into domain-specific registers, and the integration of richer semantic and pragmatic diagnostic tasks. Collaboration with speaker communities and sustained infrastructural investment will be prerequisites for closing the LLM performance gap across the world’s linguistic diversity.