Sunflower 14B and 32B: Ugandan Language Models
- Sunflower 14B and 32B are instruction-tuned Qwen-3 models designed for high-accuracy translation across 31 Ugandan languages.
- They employ LoRA-based fine-tuning and reinforcement learning with Direct Preference Optimization to reduce hallucinations and improve cultural context handling.
- Applications include government, healthcare, education, and community services, effectively preserving cultural nuances and bridging digital language gaps.
Sunflower 14B and 32B refer to a pair of instruction-tuned LLMs developed to achieve state-of-the-art comprehension and practical utility across the major Ugandan languages. Their distinguishing feature is a rigorous regional focus, which shapes both the technical approach and the composition of the training data. Both models employ the Qwen-3 transformer architecture and are open-sourced to facilitate deployment in high-impact, multilingual settings such as government, healthcare, and education.
1. Model Architecture and Instruction Fine-Tuning
The Sunflower models exist in two parameter scales—14 billion (14B) and 32 billion (32B)—with the latter generally yielding higher translation accuracy for low-resource languages. Both architectures are derived from Qwen-3 and utilize standard transformer mechanisms.
Instruction fine-tuning is performed to adapt the models for a wide range of tasks, including:
- Bidirectional translation (xx→eng, eng→xx) across 31 Ugandan languages
- Context-sensitive question answering and summarization
- Culturally-specific and creative queries
Supervised fine-tuning incorporates LoRA (Low-Rank Adaptation) with rank 16, applied exclusively to the response-side tokens in chat data. This design reduces VRAM requirements and acts as an additional regularization mechanism, discouraging instruction echoing and other artifacts common in multi-turn conversation modeling.
A subsequent reinforcement learning phase applies a variant of Direct Preference Optimization (Iterative Reasoning Preference Optimization), in which each prompt is paired with both preferred and dispreferred candidate completions. The DPO loss is augmented (mixing parameter ) to support complex preference ranking, with specific targeting of repetitive glitch loops and hallucinations that remained after initial fine-tuning.
2. Training Data Acquisition and Curation
The training data for Sunflower 14B and 32B is drawn from diverse and multimodal sources:
- Digital corpora including MADLAD-400, FLORES-200, Makerere MT Corpus, and SALT (parallel datasets for Luganda, Acholi, etc.)
- Web-scraped Ugandan news pieces, blogs, and community forums
- Printed educational and literary materials, digitized via OCR, with normalization to correct for diacritics and scanning errors
- Over 500 hours of talk-show/audio transcription produced by a fine-tuned Whisper-Large v3 for ten Ugandan languages
- Community-sourced cultural documents, including folklore, proverbs, and phrasebooks
For languages with especially sparse resources, back-translation augmentation is utilized: synthetic examples are generated using an NLLB-based translation engine, which helps mitigate extreme data imbalance.
3. Regionally Focused Comprehension and Linguistic Transfer
The models are engineered to exploit the linguistic structure typical for the region—agglutinative morphology, shared phonetics, and interwoven cultural context—rather than attempting pan-African coverage.
This focus facilitates substantial cross-lingual transfer within the group. For example, the presence of shared grammatical constructs across many Ugandan languages yields denser coverage and improved performance even for dialects with few native training samples. The inclusion of oral, printed, and community-sourced data enables the models to process practical or culturally grounded queries, such as legal procedures, healthcare instructions, or local idioms.
Performance is evaluated on automatic translation metrics. For instance, Sunflower-32B attains a mean xx→eng chrF score of approximately $0.435$, outperforming generalist models (e.g., Gemini 2.5 Pro, GPT-4o) on test suites including BLEU, chrF, CER, and WER for 31 languages.
4. Practical Applications and Impact
Sunflower 14B and 32B address key needs:
| Application Domain | Impact | Features |
|---|---|---|
| Machine Translation | Reduces barriers in government, healthcare, education | Bidirectional support for 31 languages; leading chrF performance |
| Community Access | Empowers users in civic, legal, commercial engagements | Handles culturally-grounded queries, proverbs, idioms |
| Preservation of Culture | Supports digitization and interaction with oral/printed forms | Handles non-standard, low-resource forms and folklore |
Community-driven evaluation (in-person and online) is incorporated into the feedback loop, making the models responsive to native speaker concerns and deployment realities.
5. Developmental Challenges
Several significant challenges were encountered in constructing Sunflower 14B and 32B:
- Scarcity of High-Quality Digital Data: Many languages required text digitization or transcription, resulting in OCR artifacts and difficulties with consistent normalization.
- Resource Imbalance: Languages with fewer speakers and documents necessitated synthetic augmentation via back-translation, but balancing against overfitting to high-resource languages remained problematic.
- Multidomain, Multilingual Robustness: The data mix across domains (news, education, informal conversation) forced the models to handle rapid context shifts and code-switching.
- Mitigation of Glitching/Hallucination: Infinite loops and hallucinated outputs persisted despite instruction fine-tuning, leading to the adoption of preference optimization in the RL phase (e.g., DPO with mixing parameter control).
These issues are mitigated to various extents but are still active areas of research and development.
6. Evaluation and Metrics
Performance metrics are reported in terms of translation and comprehension, with comparisons to other open and proprietary LLMs:
| Model | Mean xx→eng chrF | BLEU & Other Metrics | Domains Evaluated |
|---|---|---|---|
| Sunflower-32B | 0.435 | Highest in 31 langs | Gov, Health, Edu |
| Gemini 2.5 Pro | Lower | Varied | Generalist |
| GPT-4o | Lower | Varied | Generalist |
One illustrative mathematical snippet registered in evaluation is:
Subtract $2$: Take square root: Such examples demonstrate both technical instructional capability and syntactic accuracy for mathematical content.
7. Future Directions and Limitations
A plausible implication is that the regionally focused paradigm seen in Sunflower 14B and 32B could be extended to other multi-language localities with similar sociolinguistic structures. Nonetheless, scalability to pan-African or global coverage is constrained by data availability, transfer learning limitations, and resource requirements.
Issues such as glitching, hallucination, and uneven language resource representation suggest the necessity for continued innovation in RL phase design, feedback integration, and synthetic data generation. Hallucination reduction and robust code-switching remain persistent challenges.
In summary, Sunflower 14B and 32B represent a comprehensive approach to localized language modeling, leveraging targeted data strategies and instruction-tuned architectures to set new performance standards for Ugandan languages—with broader implications for culturally-sensitive NLP and the design of multilingual LLMs (Akera et al., 8 Oct 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free