Papers
Topics
Authors
Recent
Search
2000 character limit reached

Aya Dataset: Multilingual Instruction Tuning

Updated 3 February 2026
  • Aya Dataset is a participatory, open-access corpus that combines rigorously curated human annotations with an extensive, automatically constructed multilingual collection.
  • It features a comprehensive annotation platform, templated collections, and an evaluation suite designed to ensure linguistic fidelity and robust quality control.
  • The dataset supports multilingual generative language modeling and cross-lingual applications through over 203 million prompt-completion pairs spanning 101 languages.

The Aya Dataset is an open-access, large-scale, and participatory collection of high-quality instruction-tuning data designed to support multilingual generative language modeling across diverse linguistic and cultural contexts. Developed by Cohere For AI, Aya constitutes both a rigorously curated human-annotated dataset and an extensive, automatically constructed multilingual collection, accompanied by a dedicated annotation platform and evaluation suite. The initiative prioritizes linguistic diversity, data openness under permissive licensing, and methodological transparency (Üstün et al., 2024, Singh et al., 2024).

1. Core Components and Structure

The Aya initiative is organized into four principal components:

  1. Aya Annotation Platform: A web-based interface supporting 182 languages, facilitating broad community participation in prompt-completion annotation, re-annotation, and peer review.
  2. Aya Dataset: A human-curated corpus of 204,114 prompt-completion pairs in 65 languages, focusing on linguistic and cultural authenticity.
  3. Aya Collection: A comprehensive assembly of 513,579,625 templated and translated instances covering 114 languages, drawn from existing datasets enhanced via community templating or automatic translation.
  4. Aya Evaluation Suite: Multi-tiered evaluation benchmarks for multilingual instruction following, including human-annotated and machine-translated test prompts.

Each component is released under Apache 2.0, reflecting a unified commitment to open-science and inclusiveness (Singh et al., 2024).

2. Dataset Composition and Sources

The Aya instruction-tuning mixture combines six major data sources, detailed as follows (post-filtering counts):

Dataset Source Languages #Examples (M)
xP3x (pruned public templates) 101 168
Data Provenance (commercial corpora) 14 1.65
Aya Collection – templated 61 18.9
Aya Dataset – human-annotated 64 0.20
Aya Collection – translated 93 7.53
ShareGPT-Command (synthetic → translated) 93 6.80
  • xP3x: Extends the xP3 pool to 101 languages with human-pruned template curation, resulting in 168M examples from 56 supervised datasets and 16 task categories.
  • Data Provenance: Drawn from 161 supervised public corpora, capping sampling at 20,000 per dataset, guaranteeing self-reported commercial permissiveness and eliminating evaluation task overlap.
  • Aya Collection – templated: Around 18.9M prompt-response pairs from ~34 multilingual tasks, authored by native speakers to ensure cultural appropriateness.
  • Aya Dataset – human-annotated: 204,114 participatory, high-quality instances created and reviewed by fluent speakers, covering 65 languages (204,114 unique prompt-completion pairs).
  • Aya Collection – translated: 7.53M prompts constructed by translating 19 English IFT datasets (including Dolly v2, Flan, HotpotQA, NQ-Open) via the NLLB model, with up to 3,000 samples per task/language except for Dolly’s full set.
  • ShareGPT-Command: Synthetic set where English prompt-completion pairs are authored/generation by ShareGPT and Cohere’s Command, then NLLB-translated into 93 languages, filtered for language integrity and output length.

The resultant instruction mix contains approximately 203 million prompt-completion pairs spanning 101 languages at instruction-tuning time (Üstün et al., 2024).

3. Data Curation, Quality Control, and Preprocessing

Curation emphasizes both linguistic fidelity and data quality:

  • Annotation Workflow: The "find-fix-verify" workflow consists of original annotation, re-annotation (editing for increased fluency and informativeness), and peer review. Minimum Levenshtein distance ensures substantive edits (d5d \ge 5).
  • Quality Metrics: Peer-review ratings yield per-annotator mean quality (Q^[1,5]\hat Q\in[1,5]); a gamified score is used: Scorea=w1Ea+w2Ca\mathrm{Score}_a = w_1 E_a + w_2 C_a, where w1=max(0,Q^a3)w_1 = \max(0, \hat Q_a - 3) and w2=T+,aTaw_2 = \frac{T_{+,a}}{T_a}.
  • Review and Pruning: Two reviewers per template in xP3x; criteria for removal include brevity, duplication, or errors. 50.2% of English and 35.9% of multilingual templates were pruned, increasing average prompt length by 7% and 16.8%, respectively.
  • Filtering: Data Provenance includes only datasets with self-reported commercial permissiveness, with maximum sample capping per source. Synthetic and translated corpora also undergo output and translation quality filtering.
  • Tokenization and Packing: Standardized with the T5x→mT5 pipeline (SentencePiece vocabulary, 1024-token maximum).

Human-annotated prompt completions average 56 characters (prompt) and 177 (completion); re-annotation increases length by ~25%. Longer, more thoroughly edited examples correlate with higher peer approval rates (correlation coefficient = 0.27) (Singh et al., 2024).

4. Multilingual and Participatory Research Design

Aya's design and workflow prioritize global representation and participatory data creation:

  • Recruitment and Diversity: 2,997 registered contributors from 119 countries; demographic breakdown is 68% male, 28% female (ages 18–35 predominant).
  • Platform Accessibility: Desktop and mobile web access, with single sign-on via Discord/Google.
  • Gamification and Engagement: Points, leaderboards, badges, and Discord bot announcements incentivize sustained participation across eight months.
  • Language and Task Coverage: 65 languages (22 high-resource, 12 mid-resource, 31 low-resource) in the human-annotated dataset, and 114 languages in the full Collection; even low-resource/wikipedia-sparse languages have 104^4–106^6 examples in the templated subset.
  • Quality Control: Peer review yields average approval ratios: original Aya ≈ 0.81, translated ≈ 0.70, templated ≈ 0.62, xP3 ≈ 0.50. All examples are reviewed by multiple annotators; the Ψ (thumb-up ratio) per dataset and language informs pruning.
  • Bias and Limitation Awareness: Risk of idiosyncratic bias, under-representation for under-contributed languages (e.g., Sindhi, Zulu), and cultural bias in prompt content are acknowledged.
  • Recommendations include: quotas, rotating annotator assignments, enhanced toxicity filtering, and dynamic, data-centric evaluation (Singh et al., 2024).

5. Dataset Statistics, Coverage, and Licensing

The Aya instruction-tuning mixture is characterized by the following statistics:

  • Total size: ~203 million instruction examples (post-filtering and pruning) for model tuning.
  • Language coverage: 101 languages for training, classified as 23 high-resource, 26 mid-resource, and 52 low-resource per Joshi et al. (2020) typology. The Aya Collection reaches 114 languages (inclusive of dialects).
  • English balance: English constitutes 21.5% of the final mix (39% in xP3x alone).
  • Input and Output Lengths:
Source Avg Input (chars) Avg Target (chars)
xP3x 1048 780
Data Provenance 998 78
Aya Templated 1864 209
Aya HumanAnn 178 501
Aya Translations 496 219
ShareGPT-Command 385 1080
  • Sampling Discipline: Final mix approximates 30% high-resource, 15% mid-resource, and 55% low-resource languages.
  • Safety distillation: Adds approximately 0.75 million (3%) safe instruction pairs in 11 evaluated languages.
  • Licensing: All content and models are released under Apache 2.0. License metadata is provided per file for datasets from templating, translation, and crowd-sourcing. xP3x and all other significant collections are likewise covered under Apache 2.0 (Üstün et al., 2024).

6. Finetuning Mixture Weights and Downstream Use

Instruction-tuning is performed for 30,000 steps (approx. 25 million packed sequences) with mT5-13B using Adafactor (learning rate 3×1043\times10^{-4}), exploring three ablation mixtures:

Mixture Component Aya-HA-Heavy Aya-TM-Heavy Aya-TR-Heavy (Final)
Human-Annotated (HA) 25% 4% 10%
Templates (TM) 4% 10% 1.5%
xP3x 20% 30% 15%
Data Provenance 6% 10% 3.5%
Aya Translations (TR) 30% 20% 47.5%
ShareGPT-Command 15% 10% 22.5%

The Translation-Heavy configuration demonstrated superior performance on FLORES and open-ended benchmarks, and thus forms the production Aya model. This reflects strong reliance on automatic translation to rapidly expand multilingual reach.

Intended applications include instruction fine-tuning for LLMs in diverse languages, cross-lingual question answering, open-domain dialogue, and generation in under-served linguistic communities. Limitations include underrepresentation of dialectal variation and continued risk of skew where individual annotator influence is high in certain language tracks (Üstün et al., 2024, Singh et al., 2024).

7. Evaluation, Benchmarking, and Future Directions

Aya’s evaluation suite comprises three prompt tiers:

  1. Human-Annotated Test Set: 250 prompts each in 7 typologically diverse languages (1,750 total).
  2. Dolly Machine-Translated: 200 English prompts translated automatically into 101 languages (20,200 total).
  3. Dolly Human-Edited: Post-edited versions in 6 languages by professional annotators (1,200 total).

Focus is on open-ended evaluation—brainstorming, planning, and long-form generation. Translation quality is assessed using post-edit HTER (up to ~37% word edits for Russian) and HChrF metrics. Human post-editing ensures that test prompts are both linguistically and culturally coherent.

Recommendations for future work highlight the need for expanding language and dialect coverage, addressing speaker and content bias, and implementing enhanced automated quality and toxicity screening.

The Aya Dataset, Aya Collection, and associated resources define a framework for scalable, participatory instruction-tuning, addressing persistent multilingual and sociolinguistic gaps within current large language modeling paradigms (Singh et al., 2024, Üstün et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Aya Dataset.