Yiri Dataset for French–Bambara Translation
- Yiri dataset is a purpose-built, large-scale French–Bambara parallel corpus aggregating 353,629 sentence pairs from diverse domains, including agriculture and medicine.
- The dataset’s rigorous preprocessing, including text normalization, duplicate removal, and systematic partitioning, facilitates robust training and evaluation for translation pipelines.
- Benchmark evaluations using BLEU and chrF metrics demonstrate that transformer-based models can achieve significantly higher translation performance with Yiri.
The Yiri dataset is a purpose-built, large-scale parallel corpus for French–Bambara translation designed to address challenges in low-resource neural machine translation. Created specifically for comparative evaluation of transformer-based translation pipelines, Yiri assembles 353,629 French–Bambara aligned sentence pairs from diverse domains, notably agriculture and medicine. This extensive, curated dataset is distinguished from established benchmarks by its size, heterogeneity, and meticulous preprocessing, enabling rigorous assessment of model architectures and highlighting the impact of dataset quality on translation performance.
1. Compilation and Structural Properties
The Yiri dataset was constructed by aggregating sentence pairs from several heterogeneous sources, drawing upon both agriculture and medical domains to ensure broad relevance and social impact in communities where Bambara is spoken. Unlike extant benchmarks such as Bayelemagaba and Mafand-MT, Yiri was developed as an original resource to facilitate cross-domain translation experiments. The preprocessing pipeline employed text normalization, duplicate removal, and exclusion of outliers (such as hyperlinks and emojis) to ensure data quality.
Post-processing, the corpus was split into fixed partitions: 80% for training, 10% for validation, and 10% for testing. Yiri encompasses 353,629 parallel sentence pairs, with detailed statistics, including token counts for both Bambara and French, presented in Table 2 of the source paper.
2. Function as a Benchmark in Translation Pipelines
Within the referenced comparative paper, Yiri serves both as a principal training corpus and as a primary evaluation benchmark for three distinct translation pipelines:
- A transformer-based model trained from scratch for French–Bambara translation.
- Fine-tuned LLaMA3 (3B and 8B variant) instructor models employing decoder-only architectures.
- The LoReB hybrid pipeline integrating LaBSE embeddings with a T5-based decoder.
The dataset’s substantial size and multi-domain composition support model generalization assessment under realistic low-resource conditions. Its use as test data allows for direct comparison of pipeline performance, contrasting its impact with other benchmarks of more limited scope.
3. Evaluation Metrics and Comparative Model Performance
Model evaluation on the Yiri dataset utilizes BLEU and chrF metrics, in accordance with neural machine translation standards. Notably, the transformer-based approach (specifically the T2 configuration) achieved a BLEU score of 33.81% and a chrF score of 41.00% on the Yiri test set. These results contrast sharply with scores observed for alternative benchmarks:
Dataset | BLEU (%) | chrF (%) |
---|---|---|
Yiri (T2 config) | 33.81 | 41.00 |
Bayelemagaba | ≈10.28 | 21.01 |
Mafand-MT | ≈9.44 | 20.12 |
FLORES+ | ≈7.63 | 18.07 |
The pronounced differential suggests that Yiri’s extensive and well-aligned content enables more effective modeling of nuanced language features in Bambara and French, significantly improving translation quality compared to existing benchmarks.
4. Mathematical Models and Loss Function Specification
The paper referenced employs established toolkits, including JoeyNMT, for the computation of BLEU and chrF metrics. No explicit mathematical formulas for these metrics are provided within the documentation.
A loss function, central to the LoReB pipeline’s cross-lingual distillation, is defined by the following LaTeX formula (Equation 1 from the paper):
where is the teacher model embedding for the -th source sentence, and are the student model embeddings for source and target respectively, and denotes the minibatch size. This objective strengthens semantic alignment between French and Bambara representations under low-resource training constraints.
5. Significance and Insights for Low-Resource Translation
The creation of the Yiri dataset represents a substantial advance for low-resource French–Bambara machine translation. Data scarcity has long limited progress in translation quality for Bambara; Yiri’s scale and curation confront this barrier, enabling richer LLMing and more robust translation pipelines. The paper’s findings demonstrate that, on a comprehensive and well-curated dataset such as Yiri, simpler architectures (e.g., the balanced Transformer T2) can outperform more intricate models, yielding substantially higher BLEU and chrF scores.
Additionally, analysis revealed that aggregated, multi-source datasets such as Yiri enable improved generalization, as opposed to highly domain-specific benchmarks where instructor-based models capture patterns more effectively but may not generalize as broadly. This suggests investment in high-quality dataset construction has direct benefits for translation system development in underrepresented languages.
Overall, the Yiri dataset functions both as a novel resource and essential experimental benchmark, underscoring the critical role of data quality and representativeness in advancing neural translation technologies for low-resource languages. Its methodology and integration in evaluation frameworks provide a template for future data-centric efforts in the field.